key: cord-0192439-zk13em7d
authors: Rajotte, Jean-Francois; Mukherjee, Sumit; Robinson, Caleb; Ortiz, Anthony; West, Christopher; Ferres, Juan Lavista; Ng, Raymond T
title: Reducing bias and increasing utility by federated generative modeling of medical images using a centralized adversary
date: 2021-01-18
journal: nan
DOI: nan
sha: f421815d255b33b31d2fcfe631a7780d66e5a29c
doc_id: 192439
cord_uid: zk13em7d

We introduce FELICIA (FEderated LearnIng with a CentralIzed Adversary) a generative mechanism enabling collaborative learning. In particular, we show how a data owner with limited and biased data could benefit from other data owners while keeping data from all the sources private. This is a common scenario in medical image analysis where privacy legislation prevents data from being shared outside local premises. FELICIA works for a large family of Generative Adversarial Networks (GAN) architectures including vanilla and conditional GANs as demonstrated in this work. We show that by using the FELICIA mechanism, a data owner with limited image samples can generate high-quality synthetic images with high utility while neither data owners has to provide access to its data. The sharing happens solely through a central discriminator that has access limited to synthetic data. Here, utility is defined as classification performance on a real test set. We demonstrate these benefits on several realistic healthcare scenarions using benchmark image datasets (MNIST, CIFAR-10) as well as on medical images for the task of skin lesion classification. With multiple experiments, we show that even in the worst cases, combining FELICIA with real data gracefully achieves performance on par with real data while most results significantly improves the utility.

Learning from images to build diagnostic or prognostic models of a medical condition has become a very active research topic because of its great potential to provide better care for patients. Deep learning has been involved in much of modern progress in medical computer vision techniques, such as disease detection and classification as well as biomedical segmentation [8] . However, for such methods to capture the subtle patterns between a medical condition and an image, it is important that a model is exposed to a rich variety of cases. It is well known that images from a single source can be significantly biased by the demographics, equipment, and acquisition protocol [28] . Consequently, training a model on images from a single source would skew the performance of its prediction power towards the population from that source and potentially perform poorly for other populations. Ideally, such a model should be trained on images from as many sources as possible. To reduce the associated cost of collecting and labelling data, it is obvious that all sites such as hospitals and research centers would benefit to share their images.

Gaining access to large medical datasets requires a very lengthy approval process due to concerns about privacy breaches. Most current privacy legislation prevents datasets from being accessed and analyzed outside of a small number of dedicated servers (e.g., servers within a local hospital). However, to unleash the full power of various machine learning techniques, particularly deep learning methods, we need to find ways to share data among research groups, while satisfying privacy requirements.

How sharing actually happens can depend on many factors such as use cases, regulation, business value protection and infrastructure availability. In this work, we focus on synthetic data creation which allows multiple downstream use cases and exploration. Our objective is to show how different sites (e.g. hospitals) can help each other by creating joint or disjoint synthetic datasets that contain more utility than any of the single datasets alone. Moreover, the synthetic dataset can be used as a benchmark for machine learning in health care. To this end, we first test our method in two toy setups using common benchmark datasets, where we create artificial sites with datasets from different data distributions. This shows the potential of our method in the domain of medical imaging.

Sharing private data or their characteristics has been extensively explored recently. A common approach is to generate privacy preserving synthetic data using various variants of Generative Adversarial Networks (GAN [11] ). GANs are generative models that are able to create realistic-looking synthetic images. A GAN comprises of a generator G and a discriminator D playing a two-player game. The generator aims to create fake samples such that the discriminator will estimate their probability to be as high as possible. The discriminator on the other hand, tries to estimate the probability that a sample is real (rather than fake). PrivGAN ( [21] ) is an extension of GAN, originally designed to generate synthetic data while improving the privacy of the data used for training. Although PrivGAN was developed to be applied locally on a single dataset, previous work ( [24] ) has demonstrated that PrivGAN can be useful in a federated learning setting. In this paper, we develop a general mechanism (FELICIA) to extend a large family of GANs to a federated learning setting utilizing a centralized adversary. We explore the application of this framework to show how different sites can collaborate with each other to improve machine learning models in a privacy-preserving distributed data sharing scenario. To demonstrate the relevance of FELICIA, we focus on settings relevant to health care. However, other natural scenarios can be found in disparate domains such as in banking. Figure 1 : FELICIA architecture with N=2 users. The real data subsets are determined by the users to which we associate local components by subscripts. In this work, the users will often be referred to as sites or hospitals in our experiment scenarios.

Our main contributions are the following:

• Formalize a new federated learning mechanism (FELICIA) motivated by the PrivGAN ( [21] ) architecture, which extends to a family of GAN architectures.

• Demonstrate empirically that the hyperparameter λ can improve the utility, contrary to the original PrivGAN.

• Generalize the hyperparameter λ to be site-dependent.

• Improve the synthetic data by using generators from multiple epochs as was done in ( [4] ).

• Demonstrate the applicability of FELICIA on real non-IID data both for conditional and non-conditional synthetic data generation.

• Demonstrate the applicability of using FELICIA to enable medical images sharing in a federated learning context.

• Demonstrate that FELICIA can create a synthetic dataset without the utility bias from its local data.

Sharing data between non-local sites such as hospitals and research centers can be achieved in many ways. A popular approach to share data with privacy is to generate private synthetic data with Differential Privacy ( [7] ). Generative models such as GANs based on either differentially private stochastic gradient descent ( [1, 29] ) or the Private Aggregation of Teacher Ensembles, PATE ( [22, 30] ) are of particular interest. Both approaches suffer from low utility data for a reasonable degree of privacy. Another approach is to train a model in a federated learning setting such that the data never has to be shared ( [26, 25, 12, 9] ). Since it has been demonstrated that GANs are vulnerable to privacy attacks ( [13] ), various approaches have been proposed to provide better privacy protection. Synthetic data from GANs trained on distributed datasets with differential privacy ( [10, 6] ) suffer from the same low quality as synthetic data from centrally trained GANs, unless they have access to a very large amount of training data as in this language model application ( [19] ). FELICIA allows users to create high quality local synthetic datasets while privacy protection naturally arises from the architecture.

PrivGAN ( [21] ) is an extension of a GAN originally designed to protect against membership inference attacks, such as LOGAN and MACE ( [13, 18] ). The architecture is comprised of N GANs trained on disjoint, independent and identically distributed (IID) subsets with an extra loss from a central (private) discriminator D P . The authors show that their method "minimally affects the quality of downstream samples as evidenced by the performance on downstream learning tasks such as classification".

The key feature here is that the only connection between the subsets is the central discriminator D P accessing only synthetic data.

While the original formulation of privGAN can be seen as a modification to the original GAN architecture ( [11] ), the mechanism of utilizing multiple generatordiscriminator pairs and a centralized adversary is quite general. To that end, we first define a general family of GANs ( [3] ) that contain a single generator G, a single discriminator D and a loss governed by a measure function φ:[0,1]−→R as follows:

In the case of conditional GANs, x is replaced by the conditioned tuple (x|y) where y is the label associated with sample x. Our proposed mechanism (FELICIA) extends any GAN belonging to this family to a federated learning setting using a centralized adversary. Formally, given a measure function φ and corresponding GAN loss V φ , the federated loss is:

A notable novelty here is that λ is now a N-dimensional parameter λ=(λ 1 ,...,λ N ), one for each of the N user. Contrary to PrivGAN, both terms in FELICIA's loss have the potential to contribute to utility : local favors utility on local data and global favors utility on all users' data.

In this paper, we apply our mechanism to three separate GANs belonging to this family: i) the original GAN ( [11] ), ii) DCGAN ( [23] ), and iii) conditional GAN ( [20] ). We note however, that these are simply representative examples and the mechanism applies to a wide variety of GANs such as WGAN ( [2] ), DP-GAN ( [29] ), etc.

To implement the FELICIA mechanism we follow a process similar to the original PrivGAN paper. Specifically, we duplicate the discriminator and generator architectures of a 'base GAN' to each of the component generator-discriminator pairs of FELICIA. The privacy discriminator (D P ) is selected to be identical in architecture to the other discriminators barring the activation of the final layer. Most of the optimization effort is dedicated to train the base GAN on the whole training data to generate realistic images. Then FELICIA's implementation is optimized with the base GAN's parameters which are tuned to get good looking samples. This last step is usually much faster as the base GAN's parameters represent a good starting point.

Our experiments are based on a simulation of two hospitals (Hospital 1 and Hospital 2) with different patient populations. We consider a regulation preventing sharing images as well as models that had access to images. We will use FELICIA where Hospital 1 and Hospital 2 correspond respectively to User 1 and User 2 in Figure 1 . For our last two experiments, we define the concept of helpee and helper. The helpee is a hospital with low utility and biased dataset and the helper is a hospital with a rich and high utility dataset willing to help within the above regulation restrictions. We will show that, through the FELICIA framework, the helpee, Hospital 1, can locally generate a less unbiased synthetic dataset with more utility than its own (real) data.

First, we use the MNIST dataset ( [15] ) to show how FELICIA can help generate synthetic data with better coverage of the input distribution, even when both sites have a biased coverage of the possible input space. Second, we use a more complex dataset, CIFAR-10, to show how the utility could be significantly improved when a subgroup is underrepresented in the data. Finally, we test FELICIA in a federated learning setting with medical imagery using a skin lesion image dataset. In the first experiment, the utility is demonstrated visually by showing the distribution of the generated samples. In the other experiments, the utility is defined as the performance of a classifier trained on synthetic data (sometimes combined with real data) and evaluated on a held out real dataset.

In the first two experiments, we have kept the default parameters of the original PrivGAN implementation, namely equal λ's (i.e. λ 1 =λ 2 =1). We have also used the generator at the end of the training phase which gave satisfactory results. In our last experiment however, such implementation did not lead to synthetic data with satisfactory utility. We hypothesized that user-dependent λ i 's would better suit the scenario of a helpee being more penalized when its synthetic data is distinguishable from the helper's synthetic data. Conversely, the helper which can generate good quality synthetic data on its own, does not need to be penalized as much when it's synthetic data is distinguishable from a the biased helpee. Also, we have noticed that the utility does not increase asymptotically with increasing epoch. Inspired by previous work ( [4] ), we used synthetic images from 5 generators from the top epochs in utility. The selection of the top epochs (as well as the best combination of (λ 1 ,λ 2 ), was determined with the hold out validation set and the final utility was determined on the test set.

One setting that multiple sites may observe is when Hospital 1 owns a dataset with samples from one part of the input distribution, while Hospital 2 has a dataset with samples from a different part. We simulate this setting using the 28x28 gray scale hand written digit dataset MNIST ( [15] ). Specifically, we test whether FELICIA is able to generate representative samples from the entire input distribution while the local data is biased.

Given all images of a selected digit, we perform PCA and cluster the images in the resulting embedding space using K-means (k = 2). The resulting clusters will be used to distribute the images to the sites.

We then train FELICIA using a varying proportion of images from both clusters and compare the resulting generated images to the original images. We also compare with images generated by traditional GANs trained only on data from cluster 1 and cluster 2. Specifically, we define a mixing parameter, α, used to select the number of samples from each cluster used to fit FELICIA and two simple GANs. FELICIA will be trained on the two subsets defined as follows:

Subset 1 A random selection of α% of samples from cluster 1, and a fraction (100−α)% of samples from cluster 2.

Subset 2 Same as Subset 1 but inverting the fraction, i.e. replacing α% by (100−α)%.

Subset 1 and Subset 2 correspond respectively to X 1 Real and X 2 Real in the right diagram of Figure 1 .

For α =0, Subset 1 will be completely biased towards cluster 2 (representing a specific section of the input distribution), and for α=50 both subsets will consist of equal numbers of samples spread over the input distribution. FELICIA's training results in a generator for Subset 1, G 1 , and generator for Subset 2, G 2 . We also train two simple GANs using the data from Subset 1 and Subset 2 respectively.

Once all GANs are trained, we generate 2000 samples from each and compare them to the original samples by plotting all samples using the first two principal components from the original image embedding step. Figure 2 shows such plots for three values of α. When the bias is maximal (i.e. when α=0), FELICIA generates images only at the cluster border, while the simple GANs will generate images only from the cluster on which they were trained. This is not surprising when we consider that if the local discriminator D 1 is never trained on real images from a given cluster -it will not "allow" the generator to cover that part of the input space -the only generated samples that satisfy both discriminators are those at the border. As α increases (shown in descending rows of Figure 2) , it is clear that the samples generated by FELICIA cover more of the input space than those of the local GANs.

Another setting that various sites may observe is when one owns an imbalanced dataset while the other owns a complete (unbiased) dataset. In this setting, the owner of the imbalanced dataset, the helpee, should be able to benefit from the owner with a balanced dataset, the helper. We use the CIFAR-10 dataset [14] to simulate such a setting. CIFAR-10 is a dataset of 32x32 RGB images labeled with 10 different classes of animals and transport vehicles. To represent a biased dataset, we define two classes: class 1, the house pets class, consisting of "cats" and "dogs" and class 2, the large animal class, consisting of "deers" and "horses". Similarly to the previous experiment, we will create two subsets:

Subset 1 Contains an equal number of cat and dog samples for class 1 and an unequal number of deer and horse samples for class 2. The bias of this subset will be quantified with β, the fraction of class 2 samples that are images of deer. This represents the helpee's dataset.

Contains an equal number of cat and dog samples for class 1 and an equal number of deer and horse samples for class 2. This represents the helper's dataset.

Note that the two subsets have an equal number of images; the difference is in the proportion of deers & horses of the samples that make up class 2.

We train a CNN to discriminate between class 1 and class 2 with three different training sets: Subset 1 only, Subset 1 + GAN synthetic data (i.e. augmented with GAN), and Subset 1 + FELICIA synthetic (i.e. augmented with FELICIA) data, then measure the classification accuracy on a held out test set. FELICIA synthetic data is created from the helpee's generator associated to Subset 1. Figure 3 shows the accuracy as function of β of each classification model evaluated on the full held out test set and its deer images only. We observe that the classifier accuracy after training on real data decreases when the data is more biased towards the deer. This is expected as the test data is balanced and the reduced subgroup in the training set leads to reduced accuracy. This is confirmed in the right panels of Figure 3 , showing a decrease in accuracy of the classifier consistent with the biais of the training data. The same figure shows that augmenting the classifier training set with simple GANs synthetic data does not improve the accuracy. This is also expected as a simple GAN goal is to reproduce the training data distribution. Finally, the classifier trained on real data augmented with FELICIA synthetic data is systematically better than other classifiers. The improvement is particularly significant when the data is most biased.

In our last experiment, we apply FELICIA in a federated learning setting with a real-world medical image dataset, HAM10000 ( [27] ). Similar to the previous exper- iment, we will use these images to simulate a biased subset for the helpee Hospital 1. This dataset contains a large collection of multi-source dermatoscopic images of common pigmented skin lesions 1 . These are separated into 7 imbalanced sets of skin lesion images, from which we use the four most populated, lesion 0: Melanocytic nevi (6705 images), lesion 1: Melanoma (1113 images), lesion 2: Benign keratosis (1099 images) lesion 3: Basal cell carcinoma (514 images). Lesions 0 and 2 are benign whereas lesions 1 and 3 are associated with different types of skin cancer. We create two classes from these lesion sets:

Class 0 Images of benign lesions from lesion 0 & lesion 2

We evaluate the performance of a binary classifier trained to predict whether a lesion is benign (class 0) or cancerous (class 1). This type of skin lesion classification has shown to be successful with deep learning (see ( [8] ) and references therein).

We first randomly remove 1000 images from each class to create two equal size held out sets: a test and a validation set. From the remaining dataset, the first subset is defined similarly as the previous experiment with balanced classes but artificially biased in the lesion within one of the class. The training subsets are defined as follows:

Subset 1 Contains 300 images from each classes.

• Class 0 (benign) biased with 10 images of lesion 0 and 290 images lesion 2. • Class 1 (cancerous) balanced with 150 images of lesions 1 and 150 images from lesion 3.

Contains the remainder of the dataset.

Where Subset 1 and Subset 2 correspond respectively to X 1 Real and X 2 Real in the FELICIA diagram in Figure 1 .

We explore how the helpee (Subset 1) with limited and biased data, could be helped by the helper's (Subset 2) richer data through the FELICIA mechanism. In this experiment, the utility is defined as the performance of a benign/cancerous classifier trained on synthetic data and evaluated on the held out set. Specifically, we use area under the receiver operator characteristic curve (AUC-ROC). Since we are dealing with two hyperparameters (λ 1 ,λ 2 ), they are selected with the validation set and the final performance is reported on the test set. To address the relatively low amount of images (compared to the previous experiments), we use a conditional GAN ( [20] ) to leverage all training images for one model as oppose to a model per class as in the previous experiments 2 .

We train FELICIA on the two subsets over 100000 epochs for various combinations of (λ 1 ,λ 2 ) from equation (2) . For each set of (λ 1 ,λ 2 ), we re-run the experiment multiple times while varying the random seeds for the network initialization and data shuffling for the train, test and validation set.

For comparison, we train a conditional GAN using data only from Subset 1. This represents the synthetic data that the helpee could generate without access to the helper's dataset.

We evaluate the utility of the generated images every 50 training epochs. For each evaluated epoch, we used the generator to create 200 images for each class. A simple CNN classifier trained on these generated images is then evaluated on the 500 images of the balanced validation set. Then we select the best combination of (λ 1 ,λ 2 ) and make our final evaluation on a held out test set. Figure 4 shows how the helpee can generate very realistic images with FELICIA while images from the simple GAN are of very low quality and lack diversity. The images are produced from the saved generator at epoch leading to the best utility. We show in Figure 5 the utility as function of several combination of (λ 1 ,λ 2 ). For simple (conditional) PrivGAN, we limit the search to λ 1 =λ 2 and a single generator as in the original paper ( [21] ). We see that PrivGAN improves less the overall utility and for a limited range of λ. Table 1 summarized the utility metrics for the synthetic data generated by the helpee. FELICIA data surpasses in utility both the local real data (Subset 1) and the synthetic data from a simple (conditional) GAN. Furthermore, Table 1 shows how FELICIA improves the accuracy of the utility classifier for the penalized subgroup (Melanocytic nevi) without significantly affecting the other subgroups.

Our experiments suggest that FELICIA allows generators to learn distributions beyond a local subset. This is supported by the clusters coverage in the first experiment on the hand written digit as well as the improved utility of a synthetic dataset Table 1 : Utility of skin lesion synthetic images generated by the helpee. Melanocytic nevi is the biased subgroup (equivalent to the deer in the previous section). We note that FELICIA both improves the overall score while keeping the performance on subgroups more balanced.

compared to the local data. Moreover, these results could not be reproduced with synthetic data produced by a GAN on the same local data. We note that in our last experiment FELICIA is selecting the λ i hyperparameters based on the evaluation on a holdout dataset. In an application, this data could be available as a public dataset. When this is not possible, the utility could be determined by training a classifier on synthetic data and be evaluated securely on the helper's data with differential privacy, e.g. ( [17] , [5] ). Alternatively, one might be interested in using the λ's to weight the client's contribution for diversity rather than overall performance. In that case, they could be equal unless some sites hold very small datasets, their λ's would then have to be reduced (presumably) proportionally to the datasets size. 

We have developed a novel mechanism, FELICIA, that allows for the sharing of data more securely in order to generate synthetic data in a federated learning context. By setting up various scenarios with biased sites, we have demonstrated the advantages of our mechanism with image datasets. We have shown that a biased site can be securely helped by another site through the FELICIA architecture and that it will benefit more the more biased it is. We have also demonstrated on medical images that FELICIA can help generate synthetic images with more utility than images available locally. The work presented was implemented centrally, therefore the performance effect of the sites being distributed is still to be investigated.

FELICIA can be implemented with a wide variety of GANs which will depend on the type of data and use case. A particularly relevant use case is a pandemic such as COVID-19 where hospitals and research centers at the beginning of an outbreak would benefit from the data gathered by sites affected earlier. The data sharing approval process can easily take months, whereas pandemic microbiology evolution tells us that a virus can mutate to a different strain orders of magnitude faster. Another application is the augmentation of an image dataset to improve diagnostic such as the classification of cancer pathology images ( [16] ). The data from one research center is often biased towards some dominating population of the available data for training. FELICIA could help mitigate such bias by allowing sites from all over the world to create a synthetic dataset based on a more general population.

We are currently working on implementing FELICIA with progressive GAN in order to generate highly complex medical images such as CT scans, x-rays and histopathology slides in a real federated learning setting with non-local sites.

We show here more details related to the Table 1 . Figures 6, 7, 9 , 8, 10 corresponds to the row 1-5 of Table 1 . More precisely, the middle lines in the box plots (the medians) correspond to the values of Table 1 . We note that, not only FELICIA has the best overall performance, but it also has the smallest performance scatter on the subgroups. Figure 6 : Repeated AUC-ROC performance evaluation on the Real dataset (helpee with different seed for the classifier and shuffling) and the best synthetic lesion images from GAN, FELICIA (with PrivGAN parameters λ 1 =λ 2 ) and FELICIA. The values in bracket correspond to the selected (λ 1 , λ 2 ). Each Real point corresponds to training a classifier initialized with a different random seed on a training set selected with a different shuffling seed. Each synthetic point (GAN or FELICIA) corresponds to a different random seed for the data generation and the shuffling for creating the (real) subsets. The box extends from the lower to upper quartile values of the data, with a line at the median. The whiskers extend from the box to show the range of the data. 

Deep learning with differential privacy

Wasserstein generative adversarial networks

Generalization and equilibrium in generative adversarial nets (GANs)

Privacy-preserving generative deep neural networks support clinical data sharing

Differential privacy for classifier evaluation

GS-WGAN: A gradient-sanitized approach for learning differentially private generators

The algorithmic foundations of differential privacy

Deep learning-enabled medical computer vision

Federated generative adversarial learning

Differentially private federated learning: A client level perspective

Generative adversarial networks

MD-GAN: Multidiscriminator generative adversarial networks for distributed datasets

LOGAN: Membership inference attacks against generative models

Learning multiple layers of features from tiny images

MNIST handwritten digit database

Synthesis of diagnostic quality cancer pathology images by generative adversarial networks

Private selection from private candidates

MACE: A Flexible Framework for Membership Privacy Estimation in Generative Models

Learning differentially private recurrent language models

Conditional generative adversarial nets

privGAN: Protecting GANs from membership inference attacks at low cost to utility

Semi-supervised knowledge transfer for deep learning from private training data

Unsupervised representation learning with deep convolutional generative adversarial networks

Private data sharing between decentralized users through the privGAN architecture

FedGAN: Federated generative adversarial networks for distributed data

The future of digital health with federated learning

The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions

Detect and correct bias in multi-site neuroimaging datasets

Differentially private generative adversarial network

PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees

We are grateful to the Cascadia Data Discovery Initiative for enabling this collaboration and for granting Azure credits for part of this work.