key: cord-0138695-u8w7mqp0 authors: Wibawa, Febrianti; Catak, Ferhat Ozgur; Sarp, Salih; Kuzlu, Murat; Cali, Umit title: Homomorphic Encryption and Federated Learning based Privacy-Preserving CNN Training: COVID-19 Detection Use-Case date: 2022-04-16 journal: nan DOI: nan sha: 0347a2963ac6884f8e9680ab7d7cafa70e624a6f doc_id: 138695 cord_uid: u8w7mqp0 Medical data is often highly sensitive in terms of data privacy and security concerns. Federated learning, one type of machine learning techniques, has been started to use for the improvement of the privacy and security of medical data. In the federated learning, the training data is distributed across multiple machines, and the learning process is performed in a collaborative manner. There are several privacy attacks on deep learning (DL) models to get the sensitive information by attackers. Therefore, the DL model itself should be protected from the adversarial attack, especially for applications using medical data. One of the solutions for this problem is homomorphic encryption-based model protection from the adversary collaborator. This paper proposes a privacy-preserving federated learning algorithm for medical data using homomorphic encryption. The proposed algorithm uses a secure multi-party computation protocol to protect the deep learning model from the adversaries. In this study, the proposed algorithm using a real-world medical dataset is evaluated in terms of the model performance. Machine learning (ML) is a widely used technique in almost all fields, where a computer system can learn from data to improve its performance. This technique is widely used in many application areas such as image recognition, natural language processing, and machine translation. Federated learning is a machine learning technique where the training data is distributed across multiple machines, and the learning process is performed in a collaborative manner [13] . This technique can be used to improve the privacy and security of medical data [10] . Medical data is often highly sensitive and is often subject to data privacy and security concerns [1] . For example, a person's health information is often confidential and can be used to identify the person. Thus it is essential to protect the privacy and security of medical data. Health Insurance Portability and Accountability Act (HIPAA) (US Department of Health and Human Services, 2014) and General Data Protection Regulation (GDPR) (The European Union ,2018) strictly mandate the personal health information privacy. There are various methods to safeguard the private information. Federated learning is one of the techniques that can be utilized for the protection of sensitive data during multi-party computation tasks. This technique can be used to improve the privacy and security of medical data by preventing the data from being centralized and vulnerable. Keeping the data local is not sufficient for the security of the data and the ML model. However, there are several privacy attacks on deep learning models to get the private data [9, 25] . For example, the attackers can use the gradient information of the deep learning model to get the sensitive information. Thus the deep learning arXiv:2204.07752v1 [cs.CR] 16 Apr 2022 model itself should be protected from the adversaries as well. One of the solutions for this problem is homomorphic encryption-based model protection from the adversary collaborator. Homomorphic encryption is a technique where the data can be encrypted, and the operations can be performed on the encrypted data [4] . This technique can be used to protect the deep learning model from the adversaries. This paper proposes a privacy-preserving federated learning algorithm based convolutional neural network (CNN) for medical data using homomorphic encryption. The proposed algorithm uses a secure multi-party computation protocol to protect the deep learning model from the adversaries. We evaluate the proposed algorithm using a real-world medical dataset and show that the proposed algorithm can protect the deep learning model from the adversaries. Data-driven ML models provide unprecedented opportunities for healthcare with the use of sensitive health data. These models are trained locally to protect the sensitive health data. However, it is difficult to build robust models without diverse and large datasets utilizing the full spectrum of health concerns. Prior proposed works to overcome this problems include federated learning techniques. For instance, the studies [5, 17, 24] reviewed the current applications and technical considerations of the federated learning technique to preserve the sensitive biomedical data. Impact of the federated learning is examined through the stakeholders such as patients, clinicians, healthcare facilities and manufacturers. In another study, the authors in [16] utilized federated learning systems for brain tumour segmentation on the BraTS dataset which consist of magnetic resonance imaging brain scans. The results show that performance is decreased by the privacy protection costs. Same BraTS dataset is used in [19] to compare three collaborative training techniques, i.e., federated learning, institutional incremental learning (IIL) and cyclic institutional learning (CIIL). In IIL and CIIL, institutions train a shared model successively where CIIL adds a cycling loop through organisations. The results indicates that federated learning achieves similar Dice scores to that of models trained by sharing data. It outperform the IIL and CIIL methods since these methods suffer from catastrophic forgetting and complexity. Medical data is also safeguarded by encryption techniques such as homomorphic encryption. In [15] , authors propose an online secure multiparty computation with sharing patient information to hospitals using homomorphic encryption. Bocu et al. [7] proposed a homomorphic encryption model that is integrated to personal health information system utilizing heart rate data. The results indicates that the described technique successfully addressed the requirements for the secure data processing for the 500 patients with expected storage and network challenges. In another study by Wang et al. [23] proposed a data division scheme based homomorphic encryption for wireless sensor networks. The results show that there is trade off between resources and data security. In [14] , applicability of homomorphic encryption is shown by measuring the vitals of the patients with a lightweight encryption scheme. Sensor data such as respiration and heart rate are encrypted using homomorphic encryption before transmitting to the non-trusting third party while encryption takes place only in medical facility. The study in [20] developed an IoT based architecture with homomorphic encryption to combat data loss and spoofing attacks for chronic disease monitoring. results suggest that homomorphic encryption provide cost effective and straightforward protection to the sensitive health information. Blockchain technologies are also utilized in cooperation with homomorphic encryption for the security of medical data. Authors in [21] proposed a practical pandemic infection tracking using homomorphic encryption and blockchain technologies in intelligent trasnportatiton systems using automatic healthcare monitoring. In another study Ali et al. [3] developed a search-able distributed medical database on a blockchain using homomorphic encryption. The increase need to secure sensitive information leads to use of various techniques together. In the scope of this study, a multi-party computation tool using federated learning with homomorphic encryption is developed and analyzed. Nowadays data encryption is a common practice not only for enterprises but also individuals. It is meant to protect privacy of the data. Data encryption mostly done at rest, when the data is stored and in transit when the data is transferred. However data encryption is not popularly used upon when running or executing the operations or computations. Homomorphic encryption is an encryption method which allows arithmetical computations to be performed directly on encrypted or ciphered text without requiring any decryption. Outputs of the computations are also in encrypted form and provide identical or almost identical result when decrypted. This means that Homomorphic encryption allows data processing without disclosing the actual data. If denotes encryption, denotes decryption, and () is a function applied on actual values (plaintexts) and , using encrpytion key , then homomorphic encryption property would be: ) Homomorphic encryption can be used for privacy-preserving outsourced storage and computation. This allows data to be encrypted and out-sourced to commercial cloud environments for processing, all while encrypted. There are several types of homomorphic encryption [2] ; (1) Partially homomorphic encryption is homomorphic encryption that supports only one homomorphic operation, either addition or multiplication, with unlimited number of times. Somewhat homomorphic encryption (SHE) is used in this work since it allows both addition and multiplication operations on encrypted data which is required in aggregation of machine learning model weights. The BFV scheme is a well-known homomorphic encryption scheme. It encrypts polynomials instead of bits. The encrypted polynomials can be evaluated homomorphically. It is secure in the sense that it is CCA secure. The security is based on the hardness of the problem SIS. It can be described as follows. We now briefly describe the BFV scheme. Let be a positive integer, be a prime number, F be the finite field with elements, be a positive integer, ( 0 , 1 , . . . , −1 ) be a random tuple in F , be a positive integer, be a positive integer. Let = and = . The secret key is ( 0 , 1 , . . . , −1 ). The public key is where ( ) is a polynomial of degree less than . The decryption is done by evaluating ( ) at all points of the form and then interpolating ( ) from the resulting evaluations. Trust and privacy are among the fundamental elements of digital healthcare systems and platforms. The trust is expected to be built between various stakeholders of the digital healthcare ecosystems such as patients, medical care providers, health authorities and healthcare systems providers. The following medical data are among the most critical ones in terms of privacy and have to be protected: • Personal information related to patient such as address, social security number, birth date, and bank account number, • Provided medical and psychological services, drugs, equipment, and procedures, • Status of the patients' medical or psychological conditions, • The information related to the hospital, clinic or the medical professionals who provided the medical and psychological services. The European General Data Protection Regulation (GDPR) is among the mostly applied regulatory framework in terms of data privacy that concentrates on individual control for data subjects of 'their' data. Public and private healthcare data privacy is handled under GDPR regulations [22] . The Brakerski/Fan-Vercauteren (BFV) architecture [8, 11, 12] incorporates powerful Single Instruction Multiple Data (SIMD) parallelism, making it ideal for applications that handle massive volumes of data. In this crypto scheme, the messages are the vectors of integers, m ∈ Z . The messages are encoded into plaintext polynomials of degree . Federated learning is a machine learning technique that enables multiple parties to build and train a common machine learning model without exchanging or sharing data. Each party (client) stores and processes their own dataset (local dataset) while there is a common model shared with all parties (clients). In this case each client trains the common model using local dataset, and sends trained model to a centralized server. The server then aggregates model received from all the clients and distributes the aggregated model back to the clients. Federated learning addresses data security and privacy issues since it doesn't require access to dataset of each client, nor requires the dataset to be distributed. The local dataset itself doesn't have to be identically distributed and can be heterogeneous. This behaviour makes Federated Learning more popular in healthcare applications. Federated Learning enables health institutions to form and train a common model without transferring sensitive patient data out. There are several types of Federated Learning setting: [6] (1) Centralized federated learning. In this setting, a central server is used to populate and aggregate models from participating clients during learning process. A global common model is pushed from the server down to the clients. (2) Decentralized federated learning. In this setting, participating clients coordinate among themselves to obtain a global common model [18] . (3) Heterogeneous federated learning. In this setting, participating clients come from different technical platfrom, e.g. PC and mobile phones, with own local dataset and model while obtaining single global model. In this work, centralized federated learning setting is implemented, to demonstrated model aggregation by single centralized server. This section gives a high-level system overview of the proposed BFV crypto-scheme-based privacy-preserving federated learning COVID-19 detection training method. The proposed privacy-preserving scheme is a two-phase approach: (1) local model training at each client and (2) encrypted model weight aggregation at the server. In the local model training phase, each client builds their local CNN based DL model using their local electronic health record dataset. The clients encrypt the model weights matrix using the public key. In the second step, the server aggregates all clients' encrypted weight matrices and sends the final matrix to the clients. Each client decrypts the aggregated encrypted weight matrix to update the model weights of their DL model. Figure 1 shows the system overview. Figure 2 shows CNN based COVID-19 detection model used in the experiments. Algorithm 1 shows the overall process in the initialization phase. Each client trains the local classifier, ℎ with their private datase, D . The trained model's weight matrix, , is encrypted, ⟦ ⟧, and shared with the server Require: The dataset at client : D = { (x, ) |x ∈ R , ∈ R} =0 , public key: ← ∅ // Create an empty matrix for the encrypted layer weights 5: for each ∈ ℎ do 6: ) // Encrypt the layer weights ( . ℎ ∈ R ) with public key. 7: end for 8: Return ⟦ ⟧ // The encrypted weight matrix The server collects all encrypted weight matrices, {⟦ ⟧ 0 , · · · , ⟦ ⟧ }, from the clients. It calculates the average weight value of each neuron in the encrypted domain. Algorithm 2 shows the overall process in the aggregation phase. The last step is client decryption which each client decrypt the aggregated and encrypted weight matrix, ⟦ ⟧ , and updates their local model, ℎ. Algorithm 3 shows the overall process in the client decryption phase. We have implemented our proposed protocols and the classifier training phase in Python by using the Keras/Tensorflow libraries for the model building and the Microsoft SEAL library for the somewhat homomorphic encryption implementation. To show the training phase time performance of the proposed protocols, we tested COVID-19 x-ray scans public dataset with different number of clients and the ciphertext modulus, = {128, 192}, which determines how much noise can accumulate before decryption fails. Table 1 shows the dataset details. Samples of the dataset are depicted in Figure 3 . The dataset is arbitrarily partitioned among each client ( ∈ {2, 3, 5, 7}). , and then the prediction performance results in the encrypted-domain are compared with the results of the plain-domain. Table 2 shows the best performance of the conventional CNN method of COVID-19 Xray scans dataset. Table 3 shows the prediction performance of the CNN based classification model with and without encryption. As shown in the table, when the number of clients varies from 2 to 7, then the overall Figure 4 shows the execution times in seconds with three different configuration (i.e. plain, s=128, s=192). As expected, the execution in the encrypted domain is much higher than the plain domain. The experimental results in figure 4 provides new insights into the relationships between different number of clients and execution time. There is a significant difference in execution time between plain ( Unencrypted) and encrypted data processes. This exponential differences are due to the complexity of the homomorphic encryption and processing encrypted data. However the execution times of different ciphertext modulus values (128,192) are indistinguishable for two clients but, execution time variation is rising with the growing the number of clients. That being so, there is an anticipated trade off between execution time and security level of the models. For the prediction phase, the test performances of the both encrypted and unencrypted processes are very similar as indicated in table 3. In fact, similar performances are achieved by each model with increasing the number of clients. Moreover, for some cases, results with plain data performs slightly better than the applied Privacy preserving become an essential practice of healthcare institutions as it is mandated by both EU and the US. Federated learning and homomorphic encryption will play critical role to maintain data security and model training. With benefitting from both techniques, the proposed model achieves compatitive performance while there is a significant trade off for the execution time and number of clients. The classification metrics, i.e. accuracy, F1. precision and recall, reaches over %80 using both encrypted and plain data for each federated learning case. The privacy attacks will cause immense damages to the security and privacy of the patient information. This will hinder the advancement in healthcare using data-driven models. Therefore it is indispensable to take imperative steps to strengthen not only the safety of the information but also the way data is processed. This study demonstrated that federated learning with homomorphic encryption could be successfully applied to enhance data-driven models by eliminating and minimizing the share of the sensitive data. It is envisioned that this study could be useful for the scientists and researchers working on the sensitive healthcare data in multi-party computation settings. Big data security and privacy in healthcare: A Review A survey on homomorphic encryption schemes: Theory and implementation Anca Delia Jurcut, and Mohammed A Alzain. 2022. Deep Learning Based Homomorphic Secure Search-Able Encryption for Keyword Search in Blockchain Healthcare System: A Novel Approach to Cryptography A systematic review on the status and progress of homomorphic encryption technologies Imrana Abdullahi Yari, and Björn Eskofier. 2022. Federated Learning for Healthcare: Systematic Review and Architecture Proposal Decentralized federated learning for extended sensing in 6G connected vehicles A homomorphic encryption-based system for securely managing personal health metrics data Fully Homomorphic Encryption without Modulus Switching from Classical GapSVP Practical Implementation of Privacy Preserving Clustering Methods Using a Partially Homomorphic Encryption Algorithm Secure Multi-Party Computation based Privacy Preserving Data Analysis in Healthcare IoT Systems Somewhat Practical Fully Homomorphic Encryption Pyfhel: Python for homomorphic encryption libraries Applied Homomorphic Cryptography Advances and open problems in federated learning Amna Eleyan, and Ahcène Bounceur. 2021. A fully homomorphic encryption based on magic number fragmentation and El-Gamal encryption: Smart healthcare use case Secure Multiparty computation enabled E-Healthcare system with Homomorphic encryption Privacy-preserving federated brain tumour segmentation The future of digital health with federated learning Braintorrent: A peer-to-peer environment for decentralized federated learning Multi-institutional deep learning modeling without sharing patient data: A feasibility study on brain tumor segmentation Shared-node IoT network architecture with ubiquitous homomorphic encryption for healthcare monitoring Practical homomorphic authentication in cloud-assisted vanets with blockchain-based healthcare monitoring for pandemic control Observational health research in Europe: understanding the General Data Protection Regulation and underlying debate Data division scheme based on homomorphic encryption in WSNs for health care Federated learning for healthcare informatics CPP-ELM: Cryptographically Privacy-Preserving Extreme Learning Machine for Cloud Systems