key: cord-0599151-c5ov3b1x authors: Gawali, Manish; ArvindC, S; Suryavanshi, Shriya; Madaan, Harshit; Gaikwad, Ashrika; BhanuPrakash, KN; Kulkarni, Viraj; Pant, Aniruddha title: Comparison of Privacy-Preserving Distributed Deep Learning Methods in Healthcare date: 2020-12-23 journal: nan DOI: nan sha: 665846e24264e86c701b8aa3e89e9de52a9a8611 doc_id: 599151 cord_uid: c5ov3b1x In this paper, we compare three privacy-preserving distributed learning techniques: federated learning, split learning, and SplitFed. We use these techniques to develop binary classification models for detecting tuberculosis from chest X-rays and compare them in terms of classification performance, communication and computational costs, and training time. We propose a novel distributed learning architecture called SplitFedv3, which performs better than split learning and SplitFedv2 in our experiments. We also propose alternate mini-batch training, a new training technique for split learning, that performs better than alternate client training, where clients take turns to train a model. There is a shortage of labeled data available in the healthcare domain, and even if it is available, healthcare data is commonly distributed and needs to be aggregated at a centralized storage site so that deep learning models can be trained. However, most of the healthcare centers and laws at the country level such as the General Data Protection Regulation (GDPR) and the Health Insurance Portability and Accountability Act (HIPAA) are rightfully protective of the data and do not allow free sharing of data across computer networks and national boundaries. Distributed learning methodologies solve this problem by enabling models to train using data from various healthcare centers without compromising the privacy of the data at these centers. Federated learning (FL) [1] [2] [3] is a distributed learning method that enables training of neural network models across multiple devices or servers without the need for movement of data. This is in contrast to centralized training where all the data samples from various data sources have to be collected at a centralized processing site. In FL, multiple federated rounds are performed to obtain a robust model. The workflow for one federated round ( Figure 1 ) consists of the following steps: (i) pushing the global model from the main server to the clients (healthcare centers), (ii) training models on all healthcare center servers, and sending the local updates to the main Figure 1 : Federated Learning -n hospitals collaborate to train a global model. The global model W, which is present at the server is transmitted to all hospitals, where each hospital trains a model with their own data. The server collects the updates ∆Wi's from all hospitals and aggregates them to produce a single global update. This global update is used to tweak the global model W. server. (iii) The main server aggregates the updates received from the centers and upgrades the global model using federated averaging algorithm. This new global model is robust as it has learned from a large and diverse set of data [4] . Also, each healthcare center benefits as it can use a model which has also learned from some other healthcare center's data. Split learning (SL) [5] consists of training a machine learning model across multiple hosts by splitting the model into multiple segments. In the simplest split learning configuration called label-sharing configuration in which labels for data are present on the server, each client (healthcare center) performs one step of forward propagation step till a particular layer called the cut layer [5] as shown in figure 2 . The outputs at the cut layer are sent to the server, where the forward propagation is carried out on the rest of the network to generate predictions. The training loss is calculated at the server, using labels and predictions. A backpropagation step is performed on the network that is present at the server, i.e up to the cut layer. The gradients are sent back to the client so that a backpropagation step can be carried out on the first segment of the model. This process is repeated multiple times to obtain a final model. The server cannot access the raw local client Figure 2 : Split Learning -The neural network architecture is segmented into two parts. The first segment resides at the hospital (client) and the second segment (common across all hospitals) resides on the server. n hospitals collaborate to train a global serverside model and n client-side models. data in the training process, thus preserving the privacy of the client. SplitFed learning (SFL) is a new decentralized machine learning methodology proposed by Thapa et al. [6] , which combines the strengths of FL and SL. In the simplest configuration called the label sharing configuration, the entire neural network architecture is 'split' into two parts. Instead of training the client networks sequentially, Thapa et al. proposed training the client networks parallelly, which is a property drawn from FL. There are two variants of splitfed: SplitFedv1 (SFLv1) and SplitFedv2 (SFLv2). In SFLv1, clients perform a forward propagation step in parallel on their respective data and send the activations obtained at the cut layer to the main server. The main server performs forward propagation on the server-side network for all client activations in parallel. Subsequently, the server performs a backpropagation step and sends back the gradients to respective clients. At this time, the main server updates the server-side network using a weighted average of gradients obtained from backpropagation step. The clients perform a backpropagation step using the gradients obtained from the server and send the updates to fed server as shown in figure 3 . Fed server averages the updates received from all clients and sends out a single update to all clients. The clients use this aggregate update to tweak their models. Therefore, the client and server-side networks are synchronized. In SFLv2, the training of the server-side network is sequential; i.e, clients perform forward propagation and backpropagation one by one sequentially. The client networks are synchronized at the end of each epoch by averaging all client updates at the fed server. Sheller et al. [7] implemented federated learning in the medical domain for the first time. They demonstrated that U-Net models trained on the BraTS dataset using federated learning and models trained by traditional centralized method had Figure 3 : SplitFed -The neural network architecture is segmented into two parts: client-side model and server-side model. n hospitals collaborate to train a global server-side or n server-side models and global client-side or n client-side models depending upon the variant. Fed server averages updates from client-side models and main server averages updates for server-side models. similar dice scores. The concept of differential privacy was applied by Li et al. [8] for federated learning. Li et al. used a segmentation model for the BraTS dataset to show that incorporation of differential privacy slows down the convergence of the FL model. Gupta et al. [9] introduced split learning and applied the U-shaped split configuration in the medical domain. They compared SL with two techniques, centrally hosted and noncollaborative configuration, for two sets of problems: binary classification (fundus images) and multi-class classification (chest X-rays). With an increase in the number of clients, the performance of split learning remained stable, whereas the performance for the non-collaborative technique declined continuously. Liu et al. [10] used the federated learning framework for different deep learning architectures to detect COVID-19 using chest X-rays. Roth et al. [4] demonstrated that models trained on mammography data from multiple data sources using federated learning perform better than standalone models trained on data from a particular data source. Prior works have compared distributed learning methods with centralized training but not with other distributed learning methods for application in the medical domain. In this comparative study, we evaluate the cost (in terms of classification performance, training time, communication, and computational costs) of using distributed learning in practice. Further, we contribute to this field by introducing a novel distributed learning architecture called SplitFedv3 (SFLv3) and a new training method called alternate mini-batch training. We implement these innovations and compare them with existing distributed learning techniques and training methods. We obtained chest X-ray scans from five different sources. Three of these were private datasets, which we refer to as DT 1 , DT 2 , and DT 3 . The remaining two were publicly available research sets MIMIC [11] referred to as DT 4 and Padchest [12] referred to as DT 5 . A team of board-certified radiologists manually annotated these X-ray images using a custom built annotation tool. X-rays which showed indications of infiltrates, nodular shadows, cavitation, breakdown, lymph nodes, pleural effusion, bronchiectasis, fibrosis, scar, granuloma, nodule, pleural thickening, calcification, calcified lymph nodes, calcified pleural plaques were labelled as TBsuspect. Images which did not show these indications were labelled as TB-negative. In addition to these labels, the radiologist also drew polygon masks around the region of interest in which these manifestations were observed. Table 1 describes the dataset distribution and number of training, validation and test data taken from various sources. For each data source, the percentage of images belonging to the class TB-suspect (prevalence) in the training set is 50%. The prevalence in the validation and test sets is 10%. For experimentation, two different image resolution data was considered (i) 224x224 for densenet architecture (ii) 768x768 for U-Net architecture. The experimental network topology consists of one server and five clients, where each client has data from a single data source. The clients are virtual workers i.e they reside on the same machine as the server. We chose this topology as it is close to the practical setting where hospitals (clients) are likely to have non-I.I.D data. All of our experiments were done using PySyft [13] . We performed two sets of experiments for classification by varying the model architecture. For the first set, we used DenseNet-121 architecture [14] . For the second set, we used the U-Net architecture [15] with Xception as the backbone. The U-Net architecture is traditionally used for segmentation problems, but we used it for a classification task by deriving probabilistic output from segmentation output. For both sets of experiments, we used binary cross-entropy as the loss function and the Adam optimizer [16] with standard parameters (β 1 = 0.9 and β 2 = 0.999) and learning rate of 10 -4 . The batch size was 64 in DenseNet experiments and 4 for U-Net experiments. DenseNet models were trained for 10 epochs, whereas U-Net models were trained for 5 epochs. These models were trained for the stated number of epochs as they converge within those number of epochs. We saved the model with the least validation loss on the validation set and evaluated it on the test set. The specifications for the machine used for the experiments were 8 GB RAM, Ubuntu 18.04 OS, Tesla T4 16 GB GPU. For federated learning models, we used federated averaging algorithm [3] to update the global neural network model at the end of each federated round (epoch). We do not address the concept of differential privacy for the experiments. We experimented with two split learning configurations: the vanilla split learning/label sharing (LS) configuration and the U-shaped split-learning/ non-label sharing (NLS) configuration as shown in the figure 4. In the LS configuration, the input images remain with the clients and the labels go to the server, whereas in the NLS configuration, the input images and the labels are both present with the clients. We trained the split learning model using the alternate client (AC) training and the alternate mini-batch (AM) training techniques. In alternate client training, the clients train their networks on their entire data sequentially, and the server network, which is common for all clients, updates sequentially as well. In alternate mini-batch training, a client updates its network on one mini-batch, after which the client next in order takes over. As the number of data samples can vary for each client, if some client finishes up with its mini-batches, then it has to wait until the next epoch starts, during which other clients can continue training on mini-batches sequentially. So, sequential updates on mini-batches distinguish the server-side training in alternate mini-batch training from the server-side training in alternate client training. In the DenseNet experiments, the network was split such that first 4 layers are at the client end and the rest of the network is at the server for the label-sharing configuration. For the non-label sharing configuration, the last fully connected layer is present at the client-side in addition to first 4 layers. In the U-Net experiments, the network was split such that first 6 layers are at the client end and the rest of the network is at the server for the label-sharing configuration. For the non-label sharing configuration, the segmentation head (consisting of the last 3 layers) is at the client-side in addition to the first 6 layers. We do not use any form of weight synchronization; all client network segment weights are unique after training. We pass an image from a particular data source from train, validation, and test sets through the corresponding client network. For example, an image from the DT 5 data source, whether it be from train, validation or test set, would be passed for forward propagation through the client network residing on the client having the DT 5 data. We have excluded SFLv1 from our experiments due to the unavailability of a supercomputer. We propose a novel architecture called SplitFedv3 which has the potential to outperform SL and SFLv2. As a large trainable part of the network is at the server in SL and SFLv2, "catastrophic forgetting" [17] can happen, where the trained model favors the client data Figure 4 : Split Learning Configurations. In vanilla/ label sharing configuration client-side has raw input data and server-side has labels while in U-shaped/ non-label sharing configuration both the raw input data and labels are present at client-side it recently used for training. In SFLv3 (as shown in Algorithm 1), client-side networks are unique for each client and the server-side network is an averaged version, the same as in SplitFedv1. The problem of catastrophic forgetting is avoided due to averaging of the server-side network. In SFLv2 and SFLv3, the split happens at the same position in the networks, as described in the split learning settings for the DenseNet and U-Net experiments. For SplitFed, we used only the alternate client training technique, and we experimented with both, the LS and the NLS configurations. The distributed learning techniques are evaluated on the following metrics: performance, training time, data communication, and computation. To set a benchmark for performance, we trained a model using the traditional centralized method for both sets of experiments. For evaluating performance, we use threshold diagnostic metrics: AUROC 1 , AUPRC 2 , and threshold-dependent techniques such as F1score 3 and kappa 4 . Elapsed training time, data communication, and computation are valuable metrics for distributed learning methodologies as they provide information on the feasibility of using a method in practice. We calculate all these three metrics for one epoch of model training. In this section, the performance and feasibility of distributed learning methods across various facets is evaluated and discussed. No distributed learning method achieves the benchmark performance as the centralized model for the DenseNet and U-Net experiments (refer Figure 5 ,6,7,8,9,10,11,12 and Table 2 ). For DenseNet experiments (label sharing, non-label sharing, and alternate client training) and U-Net experiments (label sharing, alternate client training), SFLv3 performs bet-1 https://en.wikipedia.org/wiki/Receiver operating characteristic 2 https://machinelearningmastery.com/ roc-curves-and-precision-recall-curves-for-classification-in-python/ 3 https://en.wikipedia.org/wiki/F-score 4 https://en.wikipedia.org/wiki/Cohen%27s kappa Elapsed training time is the wall clock time for training a model for 1 epoch. The time taken to train the centralized and different distributed learning models is shown in Table 3 . SL, SFLv2, and SFLv3 models take almost the same time to train depending upon the configuration (label sharing or no label sharing). FL models take significantly less time to train than split learning, SFLv2, and SFLv3, for both sets of experiments. Algorithm 1 SplitFedv3 algorithm for label sharing configuration. SplitFed network(W) is divided into two parts W C and W S . The learning rate η is same for client-side and server-side model. Training client-side and server-side models at round t. 1 : procedure MAINSERVER TRAIN C t : set of n t clients participating at round t 2: for each client i ∈ C t in parallel do W C i,t : Client-side model of client i at round t 3: Loss calculation with Y i andŶ i Y i : true labels,Ŷ i : predicted labels 7: Send dA k,t := ∇ k (A s t ; W S t ) to client i 9: ClientBackprop(dA i,t ) endfor 10: Server-side model update: 12: procedure CLIENTFORWARDPROP(W C i,t ) 13 : for each local epoch from 1 to E do E : total number of local epochs at client end 15: for batch b ∈ B do B : set of local data batches 16: Forward propagation on W C i,b,t Concatenate activations from final layer of W C i,b,t to A i,t Concatenate respective true labels to Y i end end 19: Send A i,t and Y i to the main server 20: for batch b ∈ B do 23: Calculate gradients ∇ i (W C ,b,t ) (Back Prop) 24 : The amount of back-and-forth data communication that takes place between the server and all clients is shown in Table 4 . One epoch consists of training a model on train data and validating it on validation data for saving the weights. The data communication in federated learning consists of sending a model back and forth between the server and clients, whereas data communication for SL models consists of transfer of activations and gradients in training mode and transfer of activations in evaluation mode (validation). More data transfer occurs in non-label sharing configuration than the label-sharing configuration of SL. SFLv2 has an additional overhead of sending the client network models back-and-forth before and after averaging. Here, the client model segments are small in size (in the range of bytes) and have no significant effect on data communication for both DenseNets and U-Nets. In SFLv3, the server model segment needs to be averaged, but as it resides on the server, there is no need for transfer of the server model segment. The amount of data transfer in SL, SFLv2, and SFLv3 is enormous. Unless a strong network with high bandwidth is used, these methods seem infeasible to be used in practice. The data transfer in Federated Learning is low, which makes it suitable for use in practical settings. Data The computations that occur at the server (Server Flops) and clients (Client Flops) are in the range of TeraFlops. As each client has a different number of data samples, each client would have a different number of computations. We take an average of the computations for all clients and call this measure average client flops, which is in the range of Ter-aFlops. In federated learning, SFLv2 and SFLv3, the server needs to average out the models. Therefore, we have included averaging model flops as an additional parameter for comparison. Averaging model flops is in the range of MegaFlops. Since an additional part of the network resides on the client in the non-label sharing configuration, it requires fewer computations than the label sharing configuration. Our comparative study demonstrated the cost and feasibility of using distributed learning methods in practice. The proposed distributed learning architecture, SplitFedv3, performs better in terms of the four performance metrics (AUC, AUPRC, F1 Score, and kappa) than SL and SplitFedv2. Moreover, the new alternate mini-batch training technique improves the performance of SL models. Apart from classification performance, metrics like training time, data communication, and computational costs play a vital role in deciding the feasibility of a particular distributed deep learning method in practical settings. The SL, SplitFedv2, and SplitFedv3 models take more time to train compared to the FL model and require more data communication. SL, SplitFedv2, and SplitFedv3 would need a high-speed network with large bandwidth to train in practical setting. However, the FL model has higher computational costs. To train an FL model, clients would require a good number of computational resources to carry out heavy computations. Unless clients have access to GPUs, the FL method would take a lot of time to carry out computations. In contrast, the clients in SL, SplitFedv2, and SplitFedv3 models would be able to carry out the small number of computations even without access to GPUs. If we take all metrics such as performance, elapsed training time, data communication and computation into account, FL is the best distributed learning method, provided clients have adequate computing power. Federated optimization: Distributed machine learning for on-device intelligence Federated learning: Collaborative machine learning without centralized training data Communication-efficient learning of deep networks from decentralized data Federated learning for breast density classification: A real-world implementation Distributed learning of deep neural network over multiple agents Splitfed: When federated learning meets split learning Multi-institutional deep learning modeling without sharing patient data: A feasibility study on brain tumor segmentation Privacy-preserving federated brain tumour segmentation Split learning for collaborative deep learning in healthcare Experiments of federated learning for covid-19 chest x-ray images Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs Padchest: A large chest x-ray image dataset with multi-label annotated reports A generic framework for privacy preserving deep learning Densely connected convolutional networks U-net: Convolutional networks for biomedical image segmentation Adam: A method for stochastic optimization Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data Automated pancreas segmentation using multi-institutional collaborative deep learning Detailed comparison of communication efficiency of split learning and federated learning Split learning for health: Distributed deep learning without sharing raw patient data Federated learning: A survey on enabling technologies, protocols, and applications Advances and open problems in federated learning Towards federated learning at scale: System design The future of digital health with federated learning Federated optimization: Distributed optimization beyond the datacenter End-to-end evaluation of federated learning and split learning for internet of things Reducing leakage in distributed deep learning for sensitive health data Can we use split learning on 1d cnn models for privacy preserving training? The algorithmic foundations of differential privacy Multi-site fmri analysis using privacypreserving federated learning and domain adaptation: Abide results No peek: A survey of private distributed deep learning