key: cord-0155605-vre64lml authors: Yan, Rui; Qu, Liangqiong; Wei, Qingyue; Huang, Shih-Cheng; Shen, Liyue; Rubin, Daniel; Xing, Lei; Zhou, Yuyin title: Label-Efficient Self-Supervised Federated Learning for Tackling Data Heterogeneity in Medical Imaging date: 2022-05-17 journal: nan DOI: nan sha: b178b76f31dafd342ae23931ce40d95942979181 doc_id: 155605 cord_uid: vre64lml The curation of large-scale medical datasets from multiple institutions necessary for training deep learning models is challenged by the difficulty in sharing patient data with privacy-preserving. Federated learning (FL), a paradigm that enables privacy-protected collaborative learning among different institutions, is a promising solution to this challenge. However, FL generally suffers from performance deterioration due to heterogeneous data distributions across institutions and the lack of quality labeled data. In this paper, we present a robust and label-efficient self-supervised FL framework for medical image analysis. Specifically, we introduce a novel distributed self-supervised pre-training paradigm into the existing FL pipeline (i.e., pre-training the models directly on the decentralized target task datasets). Built upon the recent success of Vision Transformers, we employ masked image encoding tasks for self-supervised pre-training, to facilitate more effective knowledge transfer to downstream federated models. Extensive empirical results on simulated and real-world medical imaging federated datasets show that self-supervised pre-training largely benefits the robustness of federated models against various degrees of data heterogeneity. Notably, under severe data heterogeneity, our method, without relying on any additional pre-training data, achieves an improvement of 5.06%, 1.53% and 4.58% in test accuracy on retinal, dermatology and chest X-ray classification compared with the supervised baseline with ImageNet pre-training. Moreover, we show that our self-supervised FL algorithm generalizes well to out-of-distribution data and learns federated models more effectively in limited label scenarios, surpassing the supervised baseline by 10.36% and the semi-supervised FL method by 8.3% in test accuracy. a much more diverse and larger-scale dataset and thus usually exhibits superior performance and stronger generalizability. Therefore, this training paradigm has been adopted in many critical medical applications such as the detection of brain tumors [2] and COVID-19 [3] , [4] , and on various modalities including medical imaging data, electronic health records and sensor data [2] , [5] , [6] . As a decentralized approach, FL suffers from performance degradation due to data heterogeneity and label deficiency [7] - [9] . As shown in Fig. 1 , data heterogeneity and label deficiency are particularly pronounced in medical image datasets of realworld applications. Regarding data heterogeneity, for example, some hospitals may have more data from patients at an early stage while the others may collect the data with severe conditions only (i.e., label distribution skew). This is also referred to as statistical heterogeneity or non-identically distributed (non-IID) data partitions; large hospitals usually have more patient data than community clinics (i.e., quantity skew); and images at each hospital are acquired with different imaging acquisition protocols and on different patient populations (i.e., feature distribution skew). In terms of label deficiency, some sites may not have enough bandwidth or incentive for a complete laborintensive data labeling and thus only a small proportion of all available medical images may be labeled. While several research efforts [10] - [13] have been devoted to addressing the challenges caused by data heterogeneity, current approaches tend to deteriorate in performance when using strongly skewed data distributions [14] - [16] . To handle extremely non-IID data partitions, recent studies [16] suggest that Vision Transformers (ViTs) [17] are better alternatives to convolution neural networks (CNNs), which have become the standard architecture used in the FL framework for image data. Qu et al. [16] reveal that simply replacing CNNs with ViTs outperforms even the state-of-the-art optimization-based FL methods. However, the success of such models largely relies on supervised ImageNet pre-training, which could suffer from domain discrepancy when fine-tuning with medical images and can be further improved by self-supervised pretraining on a centrally shared large-scale in-domain medical dataset [18] . However, such centrally shared datasets rarely exist in the medical domain due to privacy and ownership concerns. Therefore, it is desired to build a self-supervised FL framework that collaboratively learns a global model by leveraging all available data without sharing data among institutions. Another big challenge lies in the problem of label deficiency, which commonly exists in the field of medical imaging. Most existing FL algorithms assume that fully labeled samples are available [19] - [21] , which is not always the case in practice. To fully leverage the available data without annotations, Yang et al. [22] combine semi-supervised learning approaches such as consistency loss [23] with FL, referred to as Semi-FL. However, for non-IID decentralized data, it remains obscure how to properly handle data heterogeneity under the federated semi-supervised learning setting. In this paper, to address these challenges, we propose a robust self-supervised FL framework as shown in Fig. 2 . To the best of our knowledge, this is the first work that simultaneously tackle the issues of data heterogeneity and label deficiency for medical imaging in FL. Self-supervised pre-training has been demonstrated to be an effective solution to alleviate the need for large-scale labeled pre-training datasets and potentially generalizes better across various tasks [24] . Moreover, unlike supervised learning which relies heavily on the label information, self-supervised pre-training learns the intrinsic features of images in local clients without labels, thus embodying less label-specific inductive bias and as a result, less susceptible to the label distribution skewness. To this end, we design a distributed self-supervised learning paradigm to improve FL in medical imaging, which includes two essential steps: (1) self-supervised federated pre-training, which exploits knowledge from decentralized unlabeled data by pre-training selfsupervised pretext tasks in a distributed setting; (2) supervised federated fine-tuning, which then transfers this knowledge to the target tasks by fine-tuning the federated models. Following [16] , we choose ViT as our backbone network due to its superior ability in handling local data distribution shifts among different clients. To fully leverage the potential of ViT for self-supervised FL, we use the masked image encoding as self-supervised pretext tasks, which has been proved effective for various natural image recognition tasks [25] , [26] . Specifically, we implement and analyze two popular masked image encoding methods, BEiT [25] and MAE [26] , as the self-supervision learning module in our federated framework. Although a few works [27] [28] have studied self-supervised pre-training methods for medical images, neither of these existing works investigates their performances in FL. We conduct extensive experiments under different degrees of data heterogeneity on diverse medical datasets including retinal images, dermatology images and chest X-rays to validate the broad effectiveness of our self-supervised FL framework. Our main contributions are summarized as follows: • We propose a generalized self-supervised federated learning framework that employs masked image encoding as self-supervised tasks to learn efficient representations from medical imaging data. • Extensive experiments under different data heterogeneity and with different ratios of labeled data verify that our framework is robust to non-IID data across clients, and is more label-efficient in comparison with the supervised baselines as well as existing FL algorithms. • For evaluation of a real-world distribution, we construct a federated chest X-rays benchmark called COVID-FL by curating the data from 8 different medical sites for testing the model's robustness in a realistic federated setting. Federated learning (FL) [1] is a distributed training technique that trains machine learning models on private data across massively decentralized parties. FL presents the challenge of data heterogeneity in the distributions of training data across clients, which usually leads to weight divergence and non-guaranteed convergence issues [11] , [29] . Many efforts have been devoted to solving the non-guaranteed convergence issues of FL, such as stabilizing the local model training on the parameter space [10] , [11] , [13] , and improving the efficiency of global model aggregation [15] , [30] - [33] . A few recent efforts leverage representation learning from unlabeled data to improve the performance of FL, a method known as federated self-supervised learning [8] , [34] , [35] , with a focus on introducing contrastive learning methods such as SimCLR [36] and MoCo [37] into FL. These methods, however, are mainly designed for natural images, and their performance in medical imaging remains unclear. In addition, these methods, which use convolution neural networks as the backbone, can have severe performance degradation when training data are strongly skewed among local clients [15] , [16] . In our work, for the first time, we introduce a transformerbased pre-training protocol based on masked image encoding into the FL framework for medical imaging and demonstrate that our method maintains good performance even on highly heterogeneous data partitions. Recently, Transformer architectures have achieved state-ofthe-art performance on many vision tasks such as ImageNet classification and medical image segmentation [17] , [38] . In addition to the compelling performance on vision and NLP tasks, they have also been shown to be robust to distribution shifts [16] , [39] and hence could improve federated learning over heterogeneous data. In this work, Vision Transformer [17] is used as the backbone. Self-supervised learning (SSL), a method that exploits unlabeled data by using the data itself to provide the supervision, has gained popularity because of its ability to learn more effective image representations [18] and avoid the cost of annotating large-scale datasets. The core of SSL lies in adopting pretext tasks as self-supervision and then using the learned representations for different downstream tasks. Various pretext tasks have been proposed for SSL including image inpainting, jigsaw puzzle, etc. [40] , [41] , which have also been demonstrated to be beneficial to medical imaging tasks [42] . A more recent proposal is to use contrastive learning methods, which force the visual representations to be close to each other for similar pairs and far apart for dissimilar pairs [36] , [37] . With the recent advance in Vision Transformer [17] , multiple works [25] , [26] have been proposed to learn image representations by signal reconstruction given corrupted images. We refer to this type of method as masked image encoding, which achieves competitive or even better results than contrastive learning methods. Our work aims at building a robust model that collaboratively learns from decentralized clients without data sharing. Specifically, our goal is to improve the model performance in FL especially for non-IID client data and in limited label scenarios. Suppose there are N clients. Each client k ∈ {1, .., N } has a local dataset D k . To learn a generalized global model over D = N k=1 {D k }, the global objective function is defined as follows: and the local objective function L k (w) in client k measuring the local empirical loss over data distribution D k is defined as: where k is the loss function used for client k, and w denotes the global model parameters to be learned. The focus of our work is to address the data heterogeneity issue in FL, given that the data across different clients are usually non-IID, i.e., D m and D n (m = n) follow different distributions P m (x, y) and P n (x, y). Furthermore, considering that some local clients may not have sufficient labeled data due to the lack of resources, in this paper, we also investigate how FL performs under limited annotation, i.e., local dataset D k comprises labeled data D k l = {(x, y)} and unlabeled data D k u = {x}, where |D k l | is relatively small. Algorithm 1: Our generalized self-supervised FL framework. T is the maximum number of communication rounds, E is the number of local epochs. To address this important problem, we propose a generalized self-supervised FL framework to enhance both the robustness and the performance of federated models when learning from decentralized data with statistical heterogeneity. Our framework comprises two stages: a self-supervised federated pretraining stage and a supervised federated fine-tuning stage, as shown in Fig. 2 and Alg. 1. During the self-supervised stage, the model exploits knowledge from decentralized data by pretraining self-supervised pretext tasks in a distributed setting. In the supervised federated fine-tuning stage, the knowledge is transferred from the previous stage to the target task by fine-tuning the federated models. In light of the recent success of Vision Transformers (ViTs), masked image encoding is employed for image representation learning during pre-training. Specifically, we integrate two popular self-supervised pre-training methods, BEiT [25] and MAE [26] , into our generalized federated framework. We denote BEiT and MAE coupled with our framework as Fed-BEiT and Fed-MAE, respectively. The pre-training and finetuning details of Fed-BEiT and Fed-MAE are illustrated in Sec. III-C and Sec. III-D. During pre-training, the k-th local model consists of two components, an autoencoder E k and an associated decoder Overview of self-supervised federated learning framework. In the pre-training stage (left), masked image encoding is employed as a self-supervised task to learn representations from unlabeled images in each client. A random subset of image patches is masked out and passed to an auto-encoder. The pre-training process mainly comprises three steps and terminates when it reaches the maximum communication rounds T . At round t, (1) Each client k (k ∈ {1, ..., N }) trains its local autoencoder E k and D k with the unlabeled local data; (2) Client k uploads the weights of its auto-encoder w t,k to the central server; (3) The server produces a global auto-encoder E G and D G with weights w t+1 via model weights averaging and broadcasts the global model back to each local client. In the fine-tuning stage (right), the pre-trained global encoder E * G from the first stage is used as an initialization for each local ViT encoder E k . A linear classifier L k is appended to each local ViT encoder. End-to-end fine-tuning is performed on labeled images in each local client in the FL setting similar to the first stage. D k , which is trained via masked image encoding. Specifically, we mask a subset of image patches and then reconstruct the original signals in the masked patches. We implement two popular masked image encoding methods BEiT [25] and MAE [26] as the self-supervision learning module in our federated framework. In this section, we describe the main components of our proposed federated pre-training protocol. For the k-th client, an input image is the dimension of the original image, C is the number of channels, (S, S) is the dimension of each image patch and P = HW/S 2 is the number of image patches. 1) Masking: We denote the masking ratio as γ, the masked positions as M, and the unmasked positions as V. After randomly masking γ% of image patches, we get |M| = γP and |M|+|V| = P . The total image patches can be represented as: p represents the masked patches and x V p represents the unmasked visible patches. Specifically, BEiT uses blockwise (n-gram) masking [25] , and MAE uses random masking. 2) Encoder: We employ ViT [17] as our encoder and apply it to a sequence of image patches as shown in Fig. 2 . • For BEiT, the input to the ViT encoder is 3) Decoder: Our decoder performs the signal reconstruction given the encoded representations of the input patches. • For BEiT, the inputs to the decoder are the encoded representations for all the patches {h i } P i=1 obtained from the last layer of the encoder. The decoder is a single linear layer to predict the visual tokens at the masked positions {z i , i ∈ M} which was generated by the DALLE pretrained dVAE [43] tokenizer. • For MAE, the inputs to the decoder are the encoded visible patches {h i , i ∈ V} along with a learnable vector for the masked patches e M p = {e i p , i ∈ M} and position embeddings. The decoder is a lightweight ViT that regresses the pixel values for the masked patches. The k-th local encoder E k and decoder D k are trained with their local data D k to minimize the local • In BEiT, k is the cross-entropy loss of the predicted visual tokens of the masked patches {z i , i ∈ M} ∈ R: • In MAE, k is the mean squared error of the predicted pixel values of the masked patches {x i p , i ∈ M} ∈ R S×S×C : In the federated pre-training stage, each local client takes E steps of gradient descent to update the local model E k and D k by minimizing its local loss L k on data D k . Then, the server takes a weighted average of all the resulting local models to update the global model E G and D G , which is further sent back to the local clients for the next training iteration. The whole pre-training process terminates when we reach the maximum number of communication rounds T . We then save the final pre-trained encoder E * k for the fine-tuning stage. During federated fine-tuning, as shown in Fig. 2 , for the k-th client, we initialize the local encoder E k with the pre-trained global encoder E * k obtained from the first stage and append a linear classifier L k upon the encoder. We further fine-tune the whole model on the local labeled data D k l . Specifically, we use average pooling to extract the learned representations from the local encoder E k . The learned representations are then fed to the linear classifier L k (i.e., a softmax layer) to minimize the cross-entropy loss for image classification tasks. In this section, we present experiments on a suite of simulated and real-world federated medical datasets to demonstrate the effectiveness of our methods. We first present the details of the dataset and experimental setup, and then evaluate the robustness of our methods on multiple diverse medical datasets. We further analyze the generalizability of the proposed method to out-of-distribution data and investigate its label efficiency by fine-tuning the model with different fractions of labels. We consider three popular tasks in the medical imaging domain, including (1) detecting diabetic retinopathy from retinal fundus images, (2) diagnosing skin lesions from dermatology images, and (3) identifying pneumonia and COVID-19 from chest X-rays. These tasks differ in terms of image modality, image acquisition, label distribution, etc. For example, retinal images are acquired by fundus cameras, dermatology images are captured by digital cameras, and chest X-rays are acquired via X-ray scanner. Visual differences among these three diverse medical datasets are demonstrated in Fig. 3 . Retina Dataset. We evaluate FL methods using Retinal Fundus images from the Kaggle Diabetic Retinopathy competition [44] . The dataset contains 35,126 images acquired from different cameras under varying exposures. The original images are divided into 5 categories (normal, mild, moderate, severe, and proliferating). We preprocessed the dataset by binarizing the labels into Normal and Diseased, and randomly images as test set according to [16] . Dermatology Dataset. This dermatology dataset is denoted as Derm in our paper. Derm includes images from ISIC17 [45] , ISIC19 [46] , ISIC20 [47] . We include ∼5,000 images in the Melanoma class (i.e., malignant class) from these three datasets and randomly sample 5,000 images in the other benign classes from ISIC19. The Derm dataset is then randomly split into training (∼7,500) and test (∼2,500) sets. COVID-FL dataset. For evaluation of a real-world distribution, we construct COVID-FL, a dataset in which each client only contains the data from one real-world site (i.e., hospital) without overlapping. Our COVID-FL dataset contains 20,018 chest X-ray scans from eight different publicly available data repositories: (1) Each data site represents one medical institution to mimic the real-world federated scenarios, and is in the absence of one or more classes. For example, BIMCV only contains chest X-ray images for COVID-19 infections while Guangzhou pediatric only includes the images for the normal and non-COVID-19 pneumonia patients (Fig. 6a) . Data between sites are acquired by different machines and on different patient populations across the world and are thus heterogeneous in intensity distribution (Fig. 6b) . The COVID-FL dataset is further divided into an 80%-20% train-test split, yielding 16,044 train images and 3,974 test images. Each site also has the same partition ratio of train and test sets. The test set can be considered as a combination of the held-out data in each client (i.e., hospital). Skin-FL dataset. The Skin-FL dataset contains skin lesion images and is constructed to evaluate the model generalization to out-of-distribution data. Following [56] , after removing the duplicates, the training set of Skin-FL consists of 22,888 images from four datasets as shown in Fig. 7a: 784 [46] . There are eight classes in total for the training set: Actinic keratosis (AK), Benign keratosis (BKL), Melanoma Moreover, we use 33,126 images from ISIC20 [47] as our out-of-distribution test set to investigate how our proposed method generalizes to unseen clients. This test set contains several classes not included in the training set, so we binarize the predictions during fine-tuning as well as the labels of our test set into Benign and Malignant following [56] . This task is very challenging in light of its severe class imbalance (Fig. 7b) . We model IID and non-IID data distributions using a Dirichlet distribution following [31] , [33] , [60] for Retina and Derm dataset. Compared with real federated data partitions, simulated data partitions allow a more flexible and thorough investigation of the model behavior by trying different degrees of data heterogeneity. Suppose a dataset has J classes, we randomly partition the data into N local clients by simulating where p j,k ∈ (0, 1) and We assign a p j,k proportion of the instances of class j to client k. The concentration parameter α in Dirichlet distribution Dir(α) controls the degree of heterogeneity and a smaller α leads to higher data heterogeneity. We simulate three sets of data partitions with α values as (α = {100, 1.0, 0.5}) for both Retina and Derm datasets, where each data partition consists of N = 5 simulated clients (see Fig. 4 and 5) . Based on the level of data heterogeneity, the 3 partitions are denoted as Split-1 (IID), Split-2 (moderate non-IID) and Split-3 (severe non-IID). COVID-FL and Skin-FL are two real-world federated data sets that contain both label distribution skewness and feature distribution skewness. Considering certain clients contain much more data than other clients, we partition those clients into sub-clients without overlapping, resulting in 12 clients in COVID-FL and 10 clients in Skin-FL in total. The distributions of these two datasets are shown in Fig. 6a and 7a . 2) Data Augmentation: During pre-training, we randomly scale and crop patches of size 224 × 224 from original images for all the datasets, followed by random color jitter and random horizontal flip. The random scaling factor is chosen from [0. 3) Self-supervised FL pre-training setup: All methods are implemented with Pytorch and deployed in a distributed training system using DistributedDataParallel (DDP). For Fed-BEiT and Fed-MAE, we adapt the official implementations of BEiT and MAE to a FL setting as shown in Fig. 2 . ViT-B/16 [17] is selected as the backbone for the proposed models. Following the setup in BEiT [25] and MAE [26] , the input is split into 14 × 14 image patches and the same number of visual tokens for BEiT and 16 × 16 patches for MAE. We randomly mask at most 40% of total image patches for BEiT and 60% of those for MAE in our main experiment. AdamW with β1 = 0.9, β2 = 0.999 is employed for optimization. We use the same set of hyperparameters for centralized and federated learning in each task. The learning rate (η) and batch size (B) vary among different tasks. More details can be found in Table I . Fed-BEiT pre-training runs for 1000 communication rounds with a warmup of 10 epochs and cosine learning rate decay of 0.05; Fed-MAE pre-training runs for 1600 communication rounds with a warmup of 5 epochs and cosine learning rate decay of 0.05. For the pre-training schedule, we note that a larger training round number generally brings more improvement, but the im- After pre-training, we obtain the pre-trained federated encoder from the server and append a linear classifier to it for downstream image classification tasks. Fine-tuning is performed end-to-end collaboratively with labeled images in each client. The model is fine-tuned for 100 communication rounds with a learning rate of 3e-3 and a batch size of 256 for all the tasks, except COVID-FL whose batch size is set to 64. We employ accuracy as our image classification evaluation metric for Retina, Derm and COVID-FL. We also calculate the AUC for the multi-class classification of lung infections in COVID-FL. For Skin-FL, following [56] , we measure the F1-score considering its severe class imbalance issues. To investigate the effectiveness of our method, we compare our self-supervised pre-trained FL method with other baseline approaches, including (1) training from scratch (ViT scratch) and (2) training with supervised models with ImageNet pretraining (ViT ImageNet) that are introduced in [16] . We also add two previously unexplored approaches: (3) training with self-supervised models with ImageNet pre-training using BEiT [25] and MAE [26] (BEiT ImageNet & MAE ImageNet) as potentially stronger baselines. Our methods, i.e., Fed-BEiT and Fed-MAE, are pre-trained directly on the decentralized target task data in a distributed setting, and the other pre-training baselines for comparison are all pre-trained on a centralized large-scale external dataset ImageNet-22K (14 million images, 21,841 classes) at resolution 224×224, under centralized settings. The fine-tuning runs for 1000 communication rounds for the model trained from scratch and 100 communication rounds for the others. All methods use ViT-B/16 [17] as the backbone architecture for a fair comparison. 1) More robust to data heterogeneity: Data heterogeneity is a key challenge in FL that our method aims to address. Table II-IV and Fig. 8 compare the performance under different degrees of statistical heterogeneity for multiple medical image classification tasks. First, we observe that our proposed method is the only method that is robust to different non-IID degrees on all medical tasks, i.e., the discrepancy between the it is worth mentioning that they do not necessarily outperform under centralized and IID settings for certain tasks such as Derm. This could be explained given that the domain shift between the pre-training dataset ImageNet-22K and the finetuning dataset Derm is relatively small and thus ImageNet pretraining works exceptionally well in this case. Nonetheless, our method consistently improves model robustness under severe data heterogeneity for these diverse medical tasks. We further investigate the performance of self-supervised models with ImageNet pre-training (i.e., BEiT ImageNet and MAE ImageNet). We consider these as potential stronger baselines given that they outperform their supervised counterparts when fine-tuning with ImageNet in centralized learning [25] [26] . However, although they perform well under centralized settings as expected, they are vulnerable to non-IID data compared with their supervised counterparts and our methods, leading to a nontrivial model degradation when label distribution skewness among clients increases. This might be attributed to the large domain gap between ImageNet and the medical imaging data. So far, we have shown that without ImageNet pre-training, our proposed method, which pre-trains directly on the target task data, could achieve comparable results under centralized settings on most tasks, and could outperform all ImageNet pretraining baselines in non-IID federated settings. This demonstrates the great potential of our framework to train a goodquality federated model in real-world medical applications, where the data distribution across hospitals is generally non-IID. The observed superior robustness of our self-supervised federated learning paradigm may benefit from its less labelspecific inductive bias than the supervised baselines. Specif- ically, during self-supervised pre-training, since no label is given, local models learn the intrinsic features of the input images without bias from the labels, and thus is not as sensitive to the lack of certain classes as the supervised baselines. Moreover, considering that there is a domain shift between ImageNet and the target medical task data, we conduct further experiments using a large chest X-ray external dataset ChestX-ray14 (CXR14 [61] ) in a similar setup as ImageNet baselines, to study the model performance when a large centralized indomain dataset is available. CXR14 consists of 112,120 chest X-ray images, which is 7× larger in size than COVID-FL training set. According to Fig. 8c and Table IV, in real-world federated split, the test accuracy improves by 0.43% using Fed-MAE while decreases by 3.19% using Fed-BEiT compared with the two self-supervised models with CXR14 pre-training (i.e., MAE CXR14 and BEiT CXR14). Both Fed-BEiT and Fed-MAE achieve better results compared with the supervised model ViT CXR14. This suggests that when there exist largescale centralized in-domain medical datasets, directly pretraining on them and fine-tuning with the downstream target task data could be a good alternative to our proposed method. Nonetheless, for various medical tasks, this kind of dataset rarely exists due to privacy and ownership concerns. 2) Generalization to out-of-distribution data: One desired property of a well-trained federated model is its generalization to the out-of-distribution data. We test the generalizability of our proposed methods using Skin-FL and compare them with the baselines stated in [56] and our ViT supervised baselines. From Table V , we find that Fed-BEiT and Fed-MAE perform marginally better than the supervised baseline with ImageNet pre-training and notably better than all other methods. 3) More label-efficient: We conduct further experiments to evaluate the model performance under limited label scenarios. This is done on Retina via reducing the number of labeled training images by different ratios during federated fine-tuning. Specifically, we take 2/3, 1/3 and 1/9 of the labels from each class such that the total number of labeled training data is reduced from 9000 to 6000, 3000 and 1000. Fig. 9 shows the test accuracy when fine-tuning with different numbers of labeled images. First, we observe that in both centralized and federated settings, our methods consistently improve the performance compared with the supervised baseline with ImageNet pretraining. In addition, the result shows that the model trained from scratch obtains unsatisfactory test accuracy when the labeled data is limited, e.g., less than 55% when the number of labeled images is 1K. 4) Comparison with prior FL methods: Two experiments are performed to study the robustness and label efficiency of our methods in comparison to prior FL methods. Specifically, we compare with (1) FedProx [11] to study the methods' robustness to severe non-IID data partitions (on Retina Split-3); and with (2) semi-supervised FL (Semi-FL [22] ) to investigate the methods' effectiveness in limited label scenarios (on Retina Split-1 with 2/3 of the labels). We use ViT-B as the backbone architecture for all methods. Fig. 10a shows the comparison of test accuracy in a severe non-IID case. To mitigate the weight divergence caused by the data heterogeneity, FedProx [11] adds an L 2 regularization term µ 2 ||w − w t || 2 to the local objective function Eq. 2 during local client updates. We observe that training using FedProx can improve 1.60% of accuracy from the FedAvg baseline after carefully tuning on the optimization parameters µ (µ is set to 0.001). However, the gain of using our methods is significantly larger than using FedProx. In particular, Fed-MAE yields a gain of 13.3% in test accuracy. Note that the application of self-supervised pre-training in our method is orthogonal to the optimization-based FL algorithms such as FedProx. Combining both could potentially boost the model performance even further. We leave this part as future work. Fig. 10b visualizes the test accuracy of different methods when training with 6K labeled data and 3K unlabeled data on Retina Split-1. To improve the model performance when labeled data is scarce, semi-FL [22] leverages the unlabeled data, together with the supervision from labeled data. It trains the clients with labeled data in a fully supervised manner for 400 epochs and then jointly trains with an additional client holding all the unlabeled data for another 400 epochs. This method was designed for segmentation tasks, and we adapt it to classification tasks. For the unlabeled client, we use a consistency loss function based on data augmentation, where we apply the cross-entropy loss between the outputs of the augmented data and the pseudo labels based on the predictions of the original data. According to Fig. 10b , our self-supervised FL method outperforms the supervised baseline by 10.36% and the semi-supervised method by 8.3% using our Fed-MAE. We conduct ablation studies to study the impacts of masking ratio and data augmentation during pre-training. All experiments are conducted on Retina under centralized settings. 1) Masking ratio: As shown in Table VI , the optimal mask ratio is 40% for BEiT and 60% for MAE. This is in accordance with the observation when using ImageNet [25] , [26] . However, the optimal masking ratio may vary among different tasks. 2) Data Augmentation: We investigate the effect of data augmentation on MAE pre-training in Table VII . Adding gray scaling and color jittering improves the accuracy by 0.7%. This indicates that task-specific data augmentations may help with pre-training on medical tasks. In this paper, we present a robust self-supervised federated learning framework which exploits masked image encoding as its self-supervised tasks to collaboratively train a model on the target task data. In particular, we introduce two popular masked image encoding methods, BEiT and MAE, as the selfsupervision learning module. Our extensive experiments show that our framework is robust to non-IID data distribution across clients, and especially achieves significant improvement over state-of-the-art ImageNet supervised pre-training baselines under severe data heterogeneity. The robustness of our proposed framework may benefit from the less label-specific inductive bias during self-supervised pre-training. Moreover, we observe that our framework generalizes well to out-of-distribution data and outperforms both the supervised and semi-supervised FL in terms of label efficiency. Communication-efficient learning of deep networks from decentralized data Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data Dynamic-fusion-based federated learning for covid-19 detection Federated learning for covid-19 screening from chest x-ray images End-to-end privacy preserving deep learning on multi-institutional medical imaging The future of digital health with federated learning Federated learning: Challenges, methods, and future directions SSFL: Tackling label deficiency in federated learning via personalized self-supervision Group knowledge transfer: Federated learning of large cnns at the edge On the convergence of fedavg on non-iid data Federated optimization in heterogeneous networks Tackling the objective inconsistency problem in heterogeneous federated optimization Scaffold: Stochastic controlled averaging for federated learning Model-contrastive federated learning Federated learning with non-iid data Rethinking architecture design for tackling data heterogeneity in federated learning An image is worth 16x16 words: Transformers for image recognition at scale Big self-supervised models advance medical image classification Towards efficient and privacy-preserving federated deep learning Towards federated learning at scale: System design Leaf: A benchmark for federated settings Federated semi-supervised learning for covid region segmentation in chest ct using multi-national data from china, italy, japan Mixmatch: A holistic approach to semi-supervised learning Using selfsupervised learning can improve model robustness and uncertainty Beit: Bert pre-training of image transformers Masked autoencoders are scalable vision learners Self pretraining with masked autoencoders for medical image analysis Self-distillation augmented masked autoencoders for histopathological image classification The non-iid data quagmire of decentralized machine learning Federated learning with matched averaging Bayesian nonparametric federated learning of neural networks Ensemble distillation for robust model fusion in federated learning Data-free knowledge distillation for heterogeneous federated learning Divergence-aware federated selfsupervised learning Federated unsupervised representation learning A simple framework for contrastive learning of visual representations Momentum contrast for unsupervised visual representation learning Transunet: Transformers make strong encoders for medical image segmentation Intriguing properties of vision transformers Context encoders: Feature learning by inpainting Unsupervised learning of visual representations by solving jigsaw puzzles Models genesis Zero-shot text-to-image generation Diabetic Retinopathy Detection Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (isic) Bcn20000: Dermoscopic lesions in the wild A patient-centric dataset of images and metadata for identifying melanomas using clinical context Bimcv covid-19+: a large annotated dataset of rx and ct images from covid-19 patients COVID-19 Image repository. figshare. dataset SIRM. COVID-19 database Data from medical imaging data resource center (midrc) -rsna international covid radiology database (ricord) release 1c -chest x-ray, covid+ (midrcricord-1c). the cancer imaging archive Kaggle. RSNA Pneumonia Detection Challenge Identifying medical diagnoses and treatable diseases by image-based deep learning Fedperl: Semi-supervised peer learning for skin lesion classification Sevenpoint checklist and skin lesion classification using multitask multimodal neural nets The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions The impact of patient clinical information on automated skin cancer detection Measuring the effects of nonidentical data distribution for federated visual classification Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases Federated semisupervised learning with inter-client consistency & disjoint learning Semi-supervised federated peer learning for skin lesion classification