key: cord-0494326-i2r1zi1p authors: Jim'enez-S'anchez, Amelia; Tardy, Mickael; Ballester, Miguel A. Gonz'alez; Mateus, Diana; Piella, Gemma title: Memory-aware curriculum federated learning for breast cancer classification date: 2021-07-06 journal: nan DOI: nan sha: 8213112b6399ecde29d8e791a4b07a59832841ac doc_id: 494326 cord_uid: i2r1zi1p For early breast cancer detection, regular screening with mammography imaging is recommended. Routinary examinations result in datasets with a predominant amount of negative samples. A potential solution to such class-imbalance is joining forces across multiple institutions. Developing a collaborative computer-aided diagnosis system is challenging in different ways. Patient privacy and regulations need to be carefully respected. Data across institutions may be acquired from different devices or imaging protocols, leading to heterogeneous non-IID data. Also, for learning-based methods, new optimization strategies working on distributed data are required. Recently, federated learning has emerged as an effective tool for collaborative learning. In this setting, local models perform computation on their private data to update the global model. The order and the frequency of local updates influence the final global model. Hence, the order in which samples are locally presented to the optimizers plays an important role. In this work, we define a memory-aware curriculum learning method for the federated setting. Our curriculum controls the order of the training samples paying special attention to those that are forgotten after the deployment of the global model. Our approach is combined with unsupervised domain adaptation to deal with domain shift while preserving data privacy. We evaluate our method with three clinical datasets from different vendors. Our results verify the effectiveness of federated adversarial learning for the multi-site breast cancer classification. Moreover, we show that our proposed memory-aware curriculum method is beneficial to further improve classification performance. Our code is publicly available at: https://github.com/ameliajimenez/curriculum-federated-learning. B REAST cancer is the most commonly occurring type of cancer worldwide for women [1] . Early detection and diagnosis of breast cancer is essential to decrease its associated mortality rate. The medical community recommends regular screening with X-ray mammography imaging for its early detection and follow-up. High-resolution images showing tissue details need to be analyzed to spot abnormalities and to provide a precise diagnosis. Despite high incidence (i.e., 12%) [1] , the extensive breast cancer screening results predominantly in negative samples. Such class imbalance can be problematic for learning-based Computer-Aided Diagnosis (CAD) systems. A potential solution to mitigate the existing class-imbalance and to increase the size of the annotated dataset is to employ data coming from multiple institutions. However, sharing medical information across (international) institutions is challenging in terms of privacy, technical and legal issues. Secure and privacy-preserving machine learning offers an opportunity to bring closer patient data protection and data usage for research and clinical routine purposes. Federated Learning (FL) aims to train a machine learning algorithm across multiple decentralized nodes holding locally the data samples, i.e., without exchanging them. Training such a decentralized model in a FL setup presents three main challenges: (i) system and statistical heterogeneity, (ii) data protection, and (iii) distributed optimization. We deal with the three challenges for breast cancer classification in the context of FL. The first challenge concerns system and data heterogeneity. For the same imaging modality, different system vendors produce images following significantly different intensity profiles. To cope with such diversity, recent works [2] , [3] have proposed to integrate Unsupervised Domain Adaptation (UDA) into the FL framework. UDA methods force the model to learn domain-agnostic features through adversarial learning [2] or a specific type of batch normalization [3] . In this work, we follow an UDA adversarial approach to handle non-IID data. To address the second challenge, data protection, cryptographic techniques [4] or differential privacy [5] , [6] are employed. Differential privacy perturbs each local model parameters by purposely adding noise before uploading them to the server for aggregation. We leverage differential privacy for data protection in our method. The third challenge concerns the distributed optimization in the FL setting. Individual models are trained locally on private data and the central server is responsible for the global aggregation of the local updates. Usually, the communication of the local models to the server occurs a certain number of times every epoch. Therefore, we propose a novel curriculum learning approach that provides a meaningful order to the samples. Contributions: In this work, we investigate for the first time the use of Curriculum Learning (CL) [7] in FL to boost the classification performance while improving domain alignment. Our CL approach is implemented via a data scheduler, which establishes a prioritization of the training samples. We assign higher importance to samples that are forgotten after the deployment of the global model. We show that presenting the training samples in this order is beneficial for FL, and also boosts the domain alignment between domain pairs. Similar to [8] we employ federated adversarial learning [2] , [9] to deal with the alignment between the different domains. However, unlike Li et al. [8] that analyze 1-D signals extracted from f-MRI, we study the screening of high-resolution mammograms and use CL to boost the classification performance. We validate our strategy on a setup composed of one public and two private clinical datasets with non-IID intensity distributions. Different from [10] , who proposes a FL framework for breast density classification and do not correct the misalignment between the domains, we target the more complex task of breast cancer classification. Furthermore, we propose a novel curriculum for the FL setting, and explicitly handle domain shift with federated adversarial domain adaptation. FL arises from the need of sharing sensitive medical data between different healthcare providers. FL has been mainly formulated in two ways: (i) differential privacy [5] , [6] , i.e., each site trains a local model with private data and only shares model parameters [11] , and (ii) protecting the details of the data using cryptographic techniques [4] , such as secure multi-party computation [12] and homomorphic encryption [13] . We focus on the differential privacy approach. Only few FL works have been shown effective on medical images. For instance, for brain tumor segmentation [14] [15] [16] ; for prediction of disease incidence, patient response to treatment, and other healthcare events [17] ; and lately for classification [8] , [18] [19] [20] [21] . Regarding breast imaging, only Roth et al. [10] have investigated breast density classification. As in [10] , we employ a client-server-based FL method with Federated Averaging (FedAvg) [22] , which combines local Stochastic Gradient Descent (SGD) on each site with a server that performs model averaging. However, [10] significantly down-sampled the input mammograms. Although low resolutions are acceptable for density classification, the detail loss penalizes the malignancy classification task. Moreover, [10] did not apply any domain adaptation technique to compensate the domain shift of the different pixel-intensity distributions. Here, we opt for a different approach by working on high-resolution mammograms with federated domain adversarial learning [2] . Deep learning methods assume that samples from the training (source) and testing (target) set are IID data. However, this statement does not always hold. When the data distribution from the source and target domains is related but different, there is a domain shift. Domain Adaptation (DA) aims to remove such shifts by transferring the learned representation from a source to a target domain. When target labels are unavailable during the training phase, UDA techniques are employed. One of the UDA strategies is to learn a domaininvariant feature extractor, which aligns the feature distribution of the target domain to that of the source by: (i) minimizing a distance of domain discrepancy [23] , (ii) revisiting batch normalization layers [24] , or (iii) through adversarial learning [25] . Despite less annotation requirements, the above UDA approaches need access to both source and target data [26] , [27] . However, in the federated setting, data is stored locally and cannot be shared. Recently, federated batch normalization [3] and federated adversarial domain adaptation [2] , [9] have been proposed to deal with DA under the privacy-preserving requirement. The work by Li The curriculum data scheduler rearranges the training samples to prioritize samples that were forgotten after the deployment of the global model. feature shift, i.e., the deviation in feature space, using batch normalization before averaging the local models. Whereas Peng et al. [2] train in an adversarial manner a feature extractor and a domain discriminator to learn a domain-invariant representation and alleviate domain shift. The latter has been applied to f-MRI on 1-D signal data using a multi-layer perceptron [8] . Different from the work by Li et al. [8] , we study federated adversarial alignment using a deep convolutional neural network (ResNet-22) on medical images, in particular, high-resolution mammograms for breast cancer classification. CL [7] is inspired in the starting small concept from cognitive science. CL methods follow a systematic and gradual way of learning. A scoring function is defined to determine the priority of the training samples. Based on this scoring function, which can measure, for example, difficulty or uncertainty, the training samples are weighted or presented in a certain order to the optimizer. This new order has an impact on the local minimum achieved by the optimizer, leading to an improvement in the classification accuracy. CL has already demonstrated an improved performance in medical image classification tasks, such as thoracic disease [28] , skin disease [29] , proximal femur fractures [30] , [31] and breast screening classification [32] . These techniques exploit either attention mechanisms [28] , meta-learning [32] , prior knowledge [29] , [30] or uncertainty in the model's predictions [31] . There is little prior work in CL in combination with DA techniques for general classification. Mancini et al. [33] inves-tigated a combination of CL and Mixup [34] for recognizing unseen visual concepts in unseen domains. Shu et al. [35] addressed two entangled challenges of weakly-supervised DA: sample noise of the source domain, and distribution shift across domains. An extreme case of DA is that of zero-shot learning, in which at test time, a learner observes samples from classes that were not observed during training. Tang et al. [36] proposed an adversarial agent, referred to as curriculum manager, which learns a dynamic curriculum for source samples. Different from [33] [34] [35] [36] that aim at improving transferability between domains, we choose to schedule the data within each domain. We design local data schedulers aiming to improve the consistency between global and local models and prevent forgetting samples that were previously correctly classified by the local model. To this end, we monitor the training samples before and after the deployment of the global model. We define a scoring function that assigns high values to samples that have been forgotten by the local model. Thus, our CL method builds locally memory-aware data schedulers to avoid forgetting. In this section, we formulate the details of our proposed curriculum approach to locally schedule training samples in the FL setting. The overall FL framework is depicted in Fig. 2 . In this setting, we assume that each local site has data storage, a computing server and a memory-aware CL module. Nevertheless, at the global level, no imaging data are stored and only computing is possible. In this type of FL setting, it is common to share the model weights and aggregate them at the central service. Moreover, local healthcare providers may have diverse imaging systems resulting in datasets with different intensity profiles. To ease the existing domain shift between the sites, we deploy an UDA strategy that shares the latent representations (and not the image data) between domain pairs. Both the model weights and the embeddings are blurred with Gaussian noise [8] to protect the private data using differential privacy [5] , [6] . The memory-aware CL module compares the local and global model predictions and assigns scores to each training sample. The data scheduler leverages the curriculum probabilities to locally arrange the samples. In Subsection III-A, the overall FL framework is presented. In Subsection III-B, we present the details of the FL setup with data privacy-preserving scheme. Then, in Subsection III-C, we introduce DA into the framework. And finally, in Subsection III-D we present the details of our proposed method leveraging CL to avoid forgetting locally learned samples in the FL setting. Next, we develop our method to learn a collaborative CAD system in a decentralized multi-site scenario with a privacypreserving strategy. Let us denote each site's dataset as D n where n = 1, ..., N and N is the total number of sites. Each dataset is composed of mammography images X n and their corresponding diagnosis Y n , i.e., D n = {X n , Y n }. We aim to detect malignant cases by training a deep-learning model. We formulate the learning objective as a binary classification task, where malignant samples correspond to the positive class. Each local model aims to minimize the cross-entropy loss over the training data from a particular site n: where y n k is the label of the k-th subject in the training label set Y n = {y n1 , ...y n |Yn| } and p n k is the corresponding output probability of the model for an input x n k ∈ X n . As depicted in Fig. 3 -left, we split the deep learning model into a feature extractor F and a classifier Cls. We refer to the output of the feature extractor as the latent representation or embedding. In this work, we assume the most challenging scenario, in which we consider that each site has mammography systems of different vendors (see Fig. 1 ). We assume that data owners collaboratively train a global model without sharing their image data. The term federated was coined because the learning task is solved by a federation of participating models (frequently referred to as clients), which are coordinated by a central server. The FL scenario is depicted in Fig. 2 . We assume that each local site has data storage and a computing node. Nevertheless, at the global level, only computing is possible. Once that individual models have been trained on private data, there are four key steps in the FL training process: (1) local updates are sent to the global server with privacy protection or encryption, (2) the central server aggregates the local updates, (3) the aggregated model parameters are deployed to the local sites, and (4) local models are updated. After that, a new round of local training starts. To apply SGD in the federated setting, each client n computes gradients on the full local data for the current model, and the central server performs the aggregation of these weights to build a global update. Let us assume a fixed learning rate η and denote the gradients at each client as g n . The central server computes the update as where m n is the number of images at site n, and M the total number of images. We refer to this algorithm as FedSGD. We can decompose the global update into local client ones: first, one takes a gradient descent step from the current model using each local dataset, ∀n w n t+1 ← − w t − η g n . Then, we let the server make a weighted average of the resulting local updates as w t+1 ← − N n=1 mn M w n t+1 . Instead of performing one global update after each local computation, we can add multiple iterations of the local update to each client before the averaging step. Model updates are performed at every communication round. Let us denote: Q, the total number of optimization iterations; τ , the communication pace; and B, the local mini-batch size used for the client updates. In each epoch, the communication between the models happens Q/τ times. Federated averaging FedAvg [22] is a generalization of FedSGD, which allows local nodes to perform more than one batch update on local data and exchanges the updated weights rather than the gradients. We build on top of FedAvg to further consider domain alignment. Medical images collected from different healthcare providers may originate from diverse devices or imaging protocols, leading to non-IID pixel-intensity distributions. In this scenario, we try to compensate the domain shift between every pair of domains. There is extensive literature on UDA methods [27] , [37] , [38] . However, these works do not generally satisfy the conditions of a FL setting: namely that data should be stored locally and not shared. To satisfy the requirements of the FL framework and to address the domain shift problem, we rely on federated adversarial alignment [2] . This method aligns the feature space by progressively reducing the domain shift between every pair of sites. To preserve privacy, only the noisy latent representations (Gaussian noise is added to each local latent representation) are shared between the sites every communication round. This method leverages a domainspecific local feature extractor F , and a global discriminator D. For source D S and target D T sites, we train individual local feature extractors F S and F T , respectively. For each (D S , D T ) source-target domain pair, we train a domain discriminator D to align the distributions. Optimization takes place in two iterative steps. In the first, the objective for discriminating the source domain from the others is defined as: where Z(·) is the Gaussian noise generator for privacy preservation. In the second step, we consider the adversarial feature extractor loss: The weights of the feature extractor F and the domain discriminator D remain unchanged for the first and second step, respectively. We propose to incorporate CL to improve the classification performance of the federated adversarial learning approach. In particular, the implementation of the curriculum is in the form of a data scheduler. A data scheduler is a mechanism that controls the order and pace of the training samples presented to the optimizer. We follow our previous work [31] and tailor it for the federated setting. In the following, we introduce the required components to define our CL method. We formalize the definition of the data scheduler through three components: a scoring function ρ, curriculum probabilities γ, and a permutation function π, and provide further details in the next paragraph. The key element of our approach is the scoring function ρ, which is specific for FL. The scoring function ρ assigns a score to every sample, which normalized becomes a curriculum probability γ. These probabilities are then used to sample the training set {X, Y }. The sampling operation establishes a permutation π determining the reordered dataset {X π , Y π }, finally fed in mini-batches to the optimizer. We consider a dynamic approach in which the scoring values are computed at every epoch e for every training sample k. We get the predictions at every site n, before and after the communication between the models, obtaining local and global predictionsŷ L ,ŷ G , respectively. To avoid forgetting in the FL setting, our scoring function ρ assigns higher values (thus higher curriculum probabilities γ) to samples that were forgotten. The order in which samples are presented to the optimizer is determined by the curriculum probabilities γ. Our function is defined as: We emphasize learning of samples for which the prediction changed from correct to wrong after the model aggregation. Our memory-aware curriculum federated learning method is summarized in Algorithm 1 (Suppl. Material). In order to validate the effect of data scheduling on the breast cancer classification, we perform experiments with two private and one public dataset. We compare our proposed approach combining FL, DA and CL, against FL alone and FL with DA. For our study, we employ 3 datasets of Full Field Digital Mammography (FFDM), coming from three different vendors: Hologic, GE and Siemens (INBreast [39] ). The first two are private clinical datasets, and the last one is publicly available. Institutional board approvals were obtained for each of the datasets. Intensity profiles [40] among the datasets varied significantly, and can also be observed in Fig. 1 . This variability is mainly due to the different mammography systems and acquisition protocols used to generate digital mammograms. We do not use any site-specific image filtering to compensate the domain shift and we apply the same preprocessing to the images from the three sites. The preprocessing consists of standard normalization with mean subtraction and division by the standard deviation. Each dataset was split into three parts with the ratio approximately of 70%:10%:20% to build respectively the training, validation and test sets. Our problem is formulated as a binary classification task. The number of samples per class and database can be found in Table I . The first class reunites benign findings and normal cases, the second class contains only malignant cases confirmed with a biopsy. Mammography images are of different sizes, we cropped the empty rows and columns, and resized to 2048 pixels in height, and then padded to 2048 pixels in width. It is often the case that important cues for diagnosis are subtle findings in the image, which could be as small as 10 pixels in length [41] . Therefore, we do not apply any further downsampling and use a resolution of 2048 pixels, close to the original resolution. We perform an in-depth evaluation of our proposed method Fed-Align-CL with a series of experiments. First, we investigate the effect of different pretraining strategies in the FL framework. Second, we compare the classification performance of our approach against other non-federated and federated approaches. Third, we investigate the influence of DA and CL in the resulting feature embeddings of the different methods. Architectures: We employ as feature extractor F the architecture proposed by Wu et al. [42] , a ResNet-22 [43] that is adapted to take high-resolution images (∼ 4 megapixels) as input. We initialize the feature extractor with the pretrained weights provided by Wu et al. [42] 1 . The weights of the classifier Cls and domain discriminator D are randomly initialized. The classifier Cls is formed by 3 fully-connected layers. The first two are followed by batch normalization, ReLu activation, and dropout. The architecture for the domain discriminator D is formed by two fully-connected layers with a ReLu activation in between and a sigmoid layer for the final output. Details of the architecture of the models can be found in Table IV [8] that employs a multi-layer perceptron for 1-D f-MRI signals, we deploy a specific CNN for high-resolution mammography images. Hyperparameters: We train our models 5 times with different seed initialization for the classifier Cls and domain discriminator D. Adam optimization is used for 50 epochs with an initial learning rate of 1e−5. We compute the adversarial domain loss L D , and also introduce the curriculum data scheduling, after training the feature extractor F and classifier Cls for 5 epochs. The dropout rate for the classifier Cls is set to 0.5. The number of optimization iterations Q = 120, and the local batch size B n = m n /Q . In each epoch, local models are updated according to the communication pace τ . The shared weights are modified by the addition of random noise to protect data from inverse interpretation leakage. We generated Gaussian noise ∼ N (0, s 2 h σ 2 ), assuming a sensitivity s h = 1 and a variance σ 2 = 0.001. We investigated different communication paces τ = {10, 20, 40, 60}, and noise values σ 2 = {0, 0.001, 0.01, 0.1}. We did not find significant differences in classification accuracy for the different communication paces τ . There is a direct correlation between the amount of noise introduced in the system and the model performance, we consider that adding a noise σ 2 = 0.001 is a good trade-off. Evaluation metrics: For the classification task, we report the area under the receiver operating characteristic curve (ROC-AUC) and AUC for the precision-recall curve (PR-AUC). a) Initialization of local models: First of all, we investigate the classification performance of the FL method with different pretraining strategies. The first case Local model corresponds to pretraining each model with their own private data. The second case Scratch to a random initialization of the local model weights. The third case DDSM corresponds to pretraining the models on the CBIS-DDSM dataset [44] . The last case corresponds to initializing the model with the publicly shared weights from Wu et al. [42] . In Table II , the AUC for the different initialization strategies is reported. We found that the best approach was using the pretrained weights from Wu et al. [42] . This behaviour is expected because their model was trained with a very-large private dataset. Moreover, the model in [42] was already pretrained with the ImageNet [45] dataset. Interestingly as well, classification results were better when all local models were initialized either randomly or pretrained on a single dataset (DDSM) than when each of them was pretrained on a small private dataset. Although DDSM dataset is large, it is formed by screen film mammography instead of FFDM, which explains the difference to Wu et al.'s weights. b) Comparison with different strategies: To demonstrate the performance of our proposed method (Fed-Align-CL), three non-federated strategies and with two federated strategies. The non-federated strategies consists of: (i) training and testing within a single site (Single); (ii) training using one site and testing on another site (Cross); and (iii) collecting multi-site data together for training (Mix). The later does not preserve data privacy since this model requires access to all training images and their respective classification labels. In Cross, we denote the site used for training as 'tr'. Also, we ignore the performance of the site used for training in this row, and report it in the row 'Single'. The federated strategies consist of training a client-server-based FL method with: (iv) FedAvg [22] , and (v) federated adversarial learning [2] , [8] . We also performed an ablation study to verify the individual contributions of the domain alignment and the curriculum scheduling. Therefore, we included in our comparison Fed-CL. Classification metrics for breast malignancy classification are reported in Table III . In the first row, we include the performance of Wu's model [42] without further training. In the first place, we present the results of the non-federated methods. First, as expected, we find that the Cross models do not generalize well across manufacturers. Second, the individual models (Single) achieve an average AUC of 0.83 for the three sites. When comparing our performance to other works on the publicly available INBreast dataset (Siemens), we achieve an AUC comparable to [42] , but lower than [40] , [46] with an AUC of 0.95. However, the later two works rely on region-wise ground truth: the first leveraging ROI localization and the second one using patch pretraining. In contrast, our models only rely on the full mammograms and their corresponding classification label. As expected, the best performing model is Mix, which is trained with mammography images and their corresponding annotations from all sites, thus, not preserving privacy. In the second place, we compare the federated approaches. First, we find that the Fed-CL method improves on average the PR-AUC with respect to Fed. However, the performance for the different domains of these two methods can be uneven. The Fed-Align approach helps to learn domain-invariant features that are beneficial for the classification task. Finally, we can see that our proposed Fed-Align-CL achieves on average the highest AUC and PR-AUC. We also find a consistent improvement with our proposed method when the models are trained from scratch. However, the classification metrics were better with the pretrained weights [42] , as discussed in the previous experiment. c) Alignment of features in latent space: In order to visualize the effect of the domain adaptation and curriculum scheduling techniques, we show in Fig. 4 the two-dimensional t-SNE [47] projection of the embedded latent space. First, we can see that the features learned by Fed are better clusterized according to the input domain, i.e., Fed learns domain-variant features. Second, the combination of FL with CL results in samples more spread along the manifold, although still dependent on the domain. In the plots that correspond to the models that perform DA, we can see that the input images are better clusterized according to the label instead of the domain. We find that the domain alignment is particularly helpful for Siemens. Fig. 5 in Supplementary Material shows a similar behaviour for the penultimate classification layer. FL is a potential solution for the future of digital health [48] , especially for classification tasks without access to sufficient data. FL allows for collaboratively training a model without sharing private data from different sites. A challenging aspect of sharing data within FL is that related to legal regulations and ethics. Privacy and data-protection need to be taken carefully into account. High-quality anonymization from a mammography or electronic record has to be guaranteed and in certain regions GDPR 1 [49] or HIPAA 2 [50] compliant. Privacy-preserving techniques for FL provide a tradeoff between model performance and reidentification. However, remaining data elements may allow for patient reidentification [51] . Unless the anonymization process destroys the data fidelity, patient reidentification or information leakage cannot be discarded. Another challenging aspect of FL is that of training a model mixing heterogeneous data, i.e., images obtained from different system vendors or acquisition protocols. In this work, we have investigated and confirmed the negative effect of domain shift for the malignancy classification with multisite mammograms. Models trained on single-vendor images did not generalize adequately to others. The best performing single-vendor model was the one for GE. Interestingly, we found that our curriculum federated learning approach did not improve in this case. This behaviour could be related to GE dataset being more similar to the pretrained weights of [42] . Moreover, due to the presence of domain shift between the datasets, the federated models that did not consider any domain adaptation performed worse than those that included domain alignment. We attribute this underperformance partially to the difference in the intensity profiles, and partially to the sizes of the datasets being insufficient for good generalization. In this work, we have investigated the use of CL to boost the alignment between domain pairs and improve the overall classification of breast cancer. In particular, our memoryaware curriculum is implemented with a data scheduler that 1 GDPR: EU/UK General Data Protection Regulation 2 HIPAA: Health Insurance Portability and Accountability Act arranges the order of the training samples. This order is defined with a scoring function that prioritizes training samples that have been forgotten after the deployment of the global model. We believe further research will follow on the use of CL in combination with FL and DA. We envision three approaches: those focused on prioritizing training (source) samples for better classification; those focused on smartly weighting the aggregation of the local models; and those focused on improving alignment between domains pairs. Similar to the work presented in this paper, other schemes can be designed to prioritize the (source) samples via a data scheduler, for instance, motivated by boosting [52] . Regarding the local model aggregation, one could deploy a CL-based adaptive weighting for clients based on a dynamic scoring function taking into account meta-information [21] , and in this way, help to cope with unbalanced and non-IID data. Finally, to improve alignment, scoring functions could rely on computing the distance between (noisy) latent representations of the source and the remaining domains to weigh each local model contribution. In this work, we have designed and integrated a CL strategy in a federated adversarial learning setting for the classification of breast cancer. We have learned a collaborative decentralized model with three clinical datasets from different vendors. We have shown that, by monitoring the local and global classification predictions, we can schedule the training samples to boost the alignment between domain pairs and improve the classification performance. 20 25 30 35 40 45 50 55 60 1st component 10 20 30 2nd componen 60 50 40 30 20 1st component 10 20 30 40 30 20 10 0 1st component 30 20 10 10 0 10 20 30 40 1st component 65 60 55 50 45 40 Domain GE Hologic Siemens Label Normal/Benign Malignant Fig. 4 : t-SNE visualization of the latent space obtained by Fed, Fed-CL, Fed-Align and Fed-Align-CL in that order. The circles represent normal and benign samples, and the crosses malignant cases. Each color represents a domain. a) Algorithm 1: presents the pseudo-code for our novel memory-aware curriculum federated learning. b) Architecture of the models: We provide the detailed model architecture in Table IV . We denote convolutional layers as Conv, max pooling layers as MaxPool, fully connected layers as FC, batch normalization layers as BN, ReLu layers as ReLu, dropout layers as Dropout and sigmoid layers as Sigmoid. For FC layers, the values in brackets represent the input and output dimensions. For Conv layers, we provide in this order: the input and output feature maps, the kernel size, the stride and the padding. For MaxPool layers, we provide in this order: kernel size, stride, padding and dilation. For dropout layers (Dropout), we provide the probability of an element to be zeroed. To define the feature extractor, we define a Block which consists of a series of layers and specify for the convolutional layers: the input and output feature maps, the kernel size, the stride for the first convolution s 1 , the stride for the second convolution s 2 , and the padding. In particular Block consists of: ReLu, BN, Conv(), BN and Conv(). c) Statistical significance: In Table V , we run a t-test between every pair of strategies to verify the significance of our results. We report the p-values between every federated strategy pairs. d) t-SNE feature visualization: Figure 5 depicts the first two components after applying t-SNE to the penultimate classification layer of every federated method. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries Federated adversarial domain adaptation Fedbn: Federated learning on non-iid features via local batch normalization Practical secure aggregation for privacy-preserving machine learning ACM SIGSAC Conference on Computer and Communications Security Our data, ourselves: Privacy via distributed noise generation The algorithmic foundations of differential privacy Curriculum learning Multi-site f-MRI analysis using privacy-preserving federated learning and domain adaptation: Abide results Private federated learning with domain adaptation Federated learning for breast density classification: A real-world implementation Inprivate digging: Enabling tree-based distributed data mining with differential privacy Aby3: A mixed protocol framework for machine learning Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption Multiinstitutional deep learning modeling without sharing patient data: A feasibility study on brain tumor segmentation Privacy-preserving federated brain tumour segmentation Feddis: Disentangled federated learning for unsupervised brain pathology segmentation Patient clustering improves efficiency of federated machine learning to predict mortality and hospital stay time using distributed electronic medical records Weight erosion: An update aggregation scheme for personalized collaborative machine learning Siloed federated learning for multi-centric histopathology datasets Contrastive cross-site learning with redesigned net for covid-19 ct classification Inverse distance aggregation for federated learning with non-iid data Communication-efficient learning of deep networks from decentralized data Transfer feature learning with joint distribution adaptation Just dial: Domain alignment layers for unsupervised domain adaptation Domain-adversarial training of neural networks Learning transferable features with deep adaptation networks Unsupervised domain adaptation by backpropagation Attention-guided curriculum learning for weakly supervised classification and localization of thoracic diseases on chest radiographs Self-paced balance learning for clinical skin disease recognition Medicalbased deep curriculum learning for improved fracture classification Curriculum learning for annotation-efficient medical image analysis: scheduling data with prior knowledge and uncertainty Training medical image analysis systems like radiologists Towards recognizing unseen categories in unseen domains mixup: Beyond empirical risk minimization Transferable curriculum for weakly-supervised domain adaptation Curriculum manager for source selection in multi-source domain adaptation Algorithms and theory for multiple-source adaptation Unpaired image-to-image translation using cycle-consistent adversarial networks INbreast Deep learning to improve breast cancer detection on screening mammography BI-RADS update Deep neural networks improve radiologists' performance in breast cancer screening Deep residual learning for image recognition A curated mammography data set for use in computer-aided detection and diagnosis research Imagenet classification with deep convolutional neural networks Detecting and classifying lesions in mammograms with deep learning Visualizing data using t-SNE The future of digital health with federated learning Data, privacy, and the greater good Hipaa regulations-a new era of medical-record privacy Estimating the success of re-identifications in incomplete datasets using generative models A short introduction to boosting Algorithm 1: Memory-aware Curriculum Federated Learning input : N number of sites, X = {X 1 , ..., X N } mammograms, Y = {Y 1 , ..., Y N } classification labels,target dataset, m n training size at site n, f w = {f w1 , ...f w N } local models, Z(·) noise generator for privacy-preserving, Q number of optimization iterations, τ communication pace, E optimization epochs, E w warm-up epochs,Memory-aware curriculum for k = 1 to m i do Get the localŷ