key: cord-0582418-1gtf0630 authors: Ulhaq, Anwaar; Burmeister, Oliver title: COVID-19 Imaging Data Privacy by Federated Learning Design: A Theoretical Framework date: 2020-10-13 journal: nan DOI: nan sha: 1680cfde6379eb11e0156e6454bd81866db03c1c doc_id: 582418 cord_uid: 1gtf0630 To address COVID-19 healthcare challenges, we need frequent sharing of health data, knowledge and resources at a global scale. However, in this digital age, data privacy is a big concern that requires the secure embedding of privacy assurance into the design of all technological solutions that use health data. In this paper, we introduce differential privacy by design (dPbD) framework and discuss its embedding into the federated machine learning system. To limit the scope of our paper, we focus on the problem scenario of COVID-19 imaging data privacy for disease diagnosis by computer vision and deep learning approaches. We discuss the evaluation of the proposed design of federated machine learning systems and discuss how differential privacy by design (dPbD) framework can enhance data privacy in federated learning systems with scalability and robustness. We argue that scalable differentially private federated learning design is a promising solution for building a secure, private and collaborative machine learning model such as required to combat COVID19 challenge. COVID-19 pandemic has changed our world and its challenges [1] . Today, more than ever, we require a centralised platform and collective approach to facilitate our collaborative research efforts in different scientific disciplines [2, 3] . Artificial intelligence, especially computer vision, has responded strongly to this challenge [4] . Various imaging modalities are being processed and analysed for COVID-19 control [5, 6] . These approaches vary from disease diagnosis and prognosis to disease prevention and management based on different imaging modalities like digital chest x-ray radiography (CXR), chest computed tomography (CT) and Lung ultrasound (LUS) [7, 8, 9] . An ethical study of imaging data requires the privacy, confidentiality and integrity throughout data analysis. However, it seems a tremendous task due to the existing vulnerabilities of traditional machine learning systems that heavily rely on shared datasets for their training. Similarly, new data protection regulation like General Data Protection Regulation (GDPR) [10] in the European Union puts restrictions on the move of data outside of their regional territories. This situation requires a fundamental change in the ways machine learning systems can work collaboratively. Privacy by design (PbD) approach [11] ensures that privacy assurance is embedded into a system design lifecycle by default from the beginning to the end. Therefore, any system built without including privacy considerations as a core part of their design process often behaves poor privacy control. A post (GDPR) world requires a focus on privacy by design with data privacy at the core of system design [12] . PbD has a special place in the design of machine learning systems, especially deep learning systems that are going to share our digital future. Google introduced a collaborative machine learning system [13, 14, 15] with embedded PbD named as federated learning in 2016 that nullified the need arXiv:2010.06177v1 [cs.CR] 13 Oct 2020 for a centralised training data. The fundamental idea is to use different clients or nodes to train local machine learning models on local data samples without any exchange, and sharing of model parameters (e.g. the weights and biases) between these local nodes at some frequency to generate a global model by a process called federated averaging. All clients or nodes then share this global model. As no data sharing takes place between the nodes, federated design simplicity ensures data privacy. These systems are ideally suited for taring machine learning algorithms for COVID-19 control as sensitive COVID-19 health data will not be shared and would remain in the custody of their subjects. Figure 1 illustrates privacy by design approach to machine learning with a federated model design for COVID-19 detection using CXR. However, this process does not guarantee data protection and system robustness. Attackers can figure out individual data based on the global model [16, 17] . Differential privacy adds random noise to an individual's model, obscuring the results [18, 19] . This random data can be added before the model is shared with the server, without revealing the actual data. This process preserves the individual's privacy. However, this design approach has limited scalability and robustness. Additionally, this integration is hard to grasp for system designers due to its unreasonable complexity. Inspired from the preliminary empirical works on differentially-private federated learning, in this paper, we introduce a theoretical framework called differential privacy by design (dPbD) that can help to design scalable and robust federating learning systems for COVID-19data privacy. We address the following research questions in this paper: • How can we devise a theoretical framework that underpins all saliant and impeding factors that impact the design of differentially-private federated learning systems with scalability and robustness? • How can differentially-private federated learning systems ensure the privacy of COVID-19 imaging data without adversely harming the accuracy? The contributions of this paper are as follows: We extend privacy by design framework in the context of differential privacy and federated learning. We propose a theoretical framework called differential privacy by design (dPbD) that underpins saliant and impeding factors that impact the design of differentially-private federated learning systems with scalability and robustness. We discuss how the proposed framework can be used to model privacy of COVID-19 imaging data for training differentially-private federated learning systems. We have organised the paper as follow: Section 2 describes the related work. A brief introduction to preliminary techniques that have inspired the development of this theoretical framework is discussed in section 3. The proposed framework is presented in section 4. We discuss seven principles to implement this design in section 5. Embedding of the proposed principles into federated learning system design for COVID-19 data is discussed in section 6. It is followed by a discussion and concluding remarks. At an early stage of COVID-19, authorities recognised the importance of privacy and public trust to fight against the pandemic. Some data privacy experts also dubbed it as an unusual scenario like the European Data Protection Board that highlighted article 9 of the General Data Protection Regulation. This article allows the processing of personal data "for reasons of public interest in the area of public health, such as protecting against serious cross-border threats to health [2] . However, careful data-management practices in our data-intensive world require giving critical consideration to data privacy aspects. COVID-19 poses various challenges to the Biomedical Imaging (BI), Artificial Intelligence (AI), Data Analytics (DA), Computer Vision (CV) and Machine learning (ML) as these disciplines require adequate access to big data to detect, diagnose, and predict the spread of the infection [5, 8, 9] . This article restricts its scope to data privacy of COVID-19 medical imaging data required by computer vision and machine learning approaches for COVID-19 control. Privacy by design is considered as an approach to embedding privacy directly into system design and introduced by Ann Cavoukian in 2009 [11, 20] . This design is incorporated into the European GDPR that shows its importance [10] . Privacy-preserving machine earning uses privacy by design approach in its framework [21] . Most recently, Google introduced a federated machine learning framework for preserving the privacy of the training data [13, 14] . Federated learning systems use distributed framework where model training is performed locally at the client-side, then the updated model parameters are sent to a central server for aggregation. Contray to the idea of data fusion [22] , a global model aggregation or fusion is used. Several works [17, 15] discuss design, challenges and future direction in the field of federated learning. We refer the interested readers to these articles. Federated learning systems inherently support privacy by design. However, PbD is criticised for being unclear, complex to enforce its implementation, and difficult to adapt to certain disciplines [12] . Anotherline of work in the privacy domain takes a huge interest in leveraging differential privacy into machine learning design for privacy-preserving AI. We refer review articles [18, 19] for the interested reader. With the increased popularity and enhanced performance of deep learning, difference privacy is proposed for deep learning models and federated learning [23, 24, 25] . The integration of differential privacy and federated learning leads future research directions in privacy preserving AI. However, this marriage is incredibly complex and lacks clarity in its design due to various competing factors. The complexity of the preliminary work in this direction inspired us to propose a theoretical framework called differential privacy by federated design. It is specially designed to provide differential privacy to COVID-19 imaging data using a federated machine learning system. 3 Preliminaries: Federated learning is a powerful framework that is devised for machine learning scientists to work collaboratively with decentralised data with privacy by default setting. Google generated the initial idea as part of the series of works in 2015 and 2016 [13, 14] . The initial focus of the federated design was on on-device federated learning tailored to distributed mobile-user interactions. A sample example is G-board app that Predicts the next word to make typing effortless based on typed text using a Federated Ruccerent neural network (RNN) model [26] . Consider a scenario of n machine learning scientists {s 1 , ..., s n } working on COVID-19 imaging data, all of whom wish to collaborate to train a machine-learning model by sharing their respective imaging data {d 1 , ..., d n }. A federatedlearning system defines a distributed machine learning process in which the ML scientists collaboratively train a shared model Mf. However, this process ensures that any ML scientist s i does not expose its data d i to any other scientist s j during the local training process. Each of the data scientists, however, shares its local model m i to the server (centralised architecture) [27] or on a blockchain ( decentralised architecture) [25] . A federating averaging is used for model aggregation to get a shared model Mf that is then sent back to each ML scientist. Various improvements are proposed in the literature after the initial design [13, 14] . It includes horizontal and vertical federated learning architectures, federated transfer learning, federated domain adaptation, federated adversarial learning, improving communication protocols and security and making federated learning more personalisable. It is emerging as a promising research topic in machine learning. Interested readers, please refer to the topic reviews [17, 15] . Differential privacy (DF) defines a formal assurance of anonymity and indistinguishability in terms of a privacy budget ( )-the smaller the budget, the stronger the confidence on privacy [18, 19] . The topic has its roots in aggregate or adjacent databases. In the case of COVID-19 imaging datasets, each COVID-19 imaging training dataset is a set of image-label pairs for supervised learning. Any two of these datasets are adjacent if they vary in a single entry if one image-label pair is present in one image dataset and absent in the other. In other words, Two databases X and Y are neighbours or adjacent if A(X, Y ) = 1 where A is the distance. A randomized mechanism K : D → R with domain D and range R preserves ( , δ differential privacy, if for any pair of adjacent databases (X, Y ) belonging to D and set S of possible outputs:P r[K(X) ∈ S] ≤ P r[K(Y ) ∈ S] + δ. A randomised mechanism like Gaussian mechanism(GM) [28] approximates a real-valued function F for these neighbouring datasets, and differential privacy can be enforced by adding noise to the model to the scale depending on the sensitivity of this function function F . The global sensitivity GS of a function F is defined as: DF provides a mathematically provable guarantee of privacy protection against a wide range of privacy attacks (include differencing attack, linkage attacks, and reconstruction attacks), Decreasing in epsilon leads to a decrease inaccuracy. is a metric of privacy loss at a differentially change in data (adding, removing one entry). The smaller the value is, the better privacy protection while accuracy is defined to be the closeness of the output of DP algorithms to pure production. F (X) can be released accurately when F is insensitive to individual models [29] . Many natural functions have low GS like sample mean and covariance matrix. To achieve a small global sensitivity, the ideal condition is that all the clients use sufficient local datasets for training [24] . In privacy-preserving machine learning, we search for an algorithm that takes as input a dataset (sampled from some distribution), and then privately output a hypothesis h that with high probability has low error over the distribution. Our proposed theoretical framework is extending privacy by design framework [11] for federating machine learning design by using the notion of embedding differential privacy into its system design from end to end. Figure 2 provides a visual description of our proposed differential privacy by design (dPbD) framework. This framework is partially inspired by empirical studies [23, 24, 28, 25, 27] on the use of differential privacy for machine learning. From these empirical studies, two major dimensions consistently emerged. One dimension differentiates between privacy and utility and highlights their trade-off. Utility in machine learning can be taken as model accuracy or decrease in testing loss-the other dimensions design scalability and robustness. Here scalability means scalability of federated learning model and number of clients. Robustness indicates system performance against attacks. There are four quadrants and thus four important junctions of the proposed framework that define most important factors in defining differential privacy in federated learning design. These factors include level of privacy, level of randomised noise, global sensitivity and number of clients or nodes in a differentially private federated learning system. As differential privacy guarantees anonymity and indistinguishability in terms of a privacy budget (epsilon)-the smaller the budget, the stronger the confidence on privacy, and we call it a level of privacy. It is significant characteristics of differential privacy by design as it guarantees and provides a quantitative notion of privacy compared to the unclear concept of privacy in privacy by design framework. The level of privacy defines system robustness against attacks. Similarly adding more noise as part of differential privacy can reduce system utility and robustness. The amount of randomised noise is thus an important factor to balance the trade-off between privacy and utility. On the other hand, scalability requires an increase in the number of nodes and the effectiveness of robust aggregation. Robust aggregation depends on global sensitivity. A small value of global sensitivity requires that all the clients use sufficient local datasets for training, .and the type of aggregation function. Similarly, functions having low global sensitivity are preferred for better performance and utility like Sample mean and Covariance matrix. We advise these seven principles to accomplish proposed differential privacy by design framework for system design. • Privacy can be guaranteed in design: The system must ensure a meaningful privacy guarantee. For instance, choosing a smaller epsilon produces noisier results and better privacy guarantees in differential privacy. Privacy guarantee during system design will build trust. • Privacy can be quantified in design: Differential privacy can be used to quantify privacy. The strategy of using budgets, expenses and losses in terms of privacy is known as privacy accounting. The maximum privacy loss is called the privacy budget. This quantification leads to better privacy-preserving design. • Privacy by Modularity: Modularity is the Key to ensure privacy as it reduces complexity [31] . Modules can be removed, replaced, or upgraded without affecting other components. Privacy of the system should not be affected by removing, replacing, or upgrading any system component. sensitivity. The aggregated model should also be insensitive to local data at different nodes. • Privacy with Scalability :Privacy notion is unaffected by scaling the system. Any change in scalability for Federated Learning system and asynchronous or synchronous training algorithms should not degrade privacy. • Anonymity in Design Lifecycle: Original data is never shared. Only modified function or model parameters are transmitted. Data anonymity must be ensured in complete Design Lifecycle. • Optimising Privacy-Utility Trade-off in Design: we refer utility to certain system properties [32] . Increase in privacy often causes a decrease in utility. A design should optimise this trade-off by ensuring privacy without adversely harming system utility. The privacy of COVID-19 imaging data on a collaborative machine learning network (federated machine learning) can be taken as a case study of differential privacy by design framework. Its seven foundation principles can be applied to ensure privacy while designing collaborative machine learning systems. Here, we describe a brief view of the implementation of these seven principles in the scenario of the training machine learning model on federated learning systems. • Privacy-preserving federated learning design must control the level of privacy to control perturbation and noise for robustness. However, the design must provide equate level of privacy to prevent patients personal information embedding or related to their COVID-19 imaging data. • Privacy level must be quantified in federated design. There must be understanding and consensus between teams involved in federated learning setup about the minum and maximum level of privacy required for collaborative research. • Adequate privacy must be enforced during design at all system components like clients, federated server and communication channels. Addition or removal of clients and change in client side learning model should not affect data privacy promised by the system. • A robust model aggregation strategy must be adopted. A function with low global sensitivity be chosen while designing any federated system for COVID-19. The robust aggregation should be insensitive to local changes and changes in data on client nodes. • COVID-19 imaging data is never shared throughout the training lifecycle. Model sharing is protected and maintain the anonymity of patients information through the model up-gradation and communication rounds. • The system design must be flexible, and adding and removing clients and teams or increasing or decreasing COVID-19 data size should not adversely harm patient data and shared model privacy. Horizontal and vertical scalability be supported during design with features of robustness and resilience. • Accuracy is an important consideration in COVID diagnosis, and it should not be harmed. It is critical to reduce or sacrifice it for the sake of privacy as it can adversely affect system utility. An optimised trade-off be sought during the design. The proposed differential privacy by design framework is focused on the design of privacy-preserving federated machine learnng systems. This theoretical framework is developed by inspiring from various empirical studies about the use of differential privacy in federated learning. However, we discovered that while the majority of the proposed systems emphasise on the trade-off between privacy and utility, they often ignore scalability and robustness of the system. Our proposed framework fills up that gap and proposes a comprehensive framework that covers the majority of the design concepts. We argue that following the footsteps of privacy by design framework, differential privacy must be embedded throughout the design lifecycle, and it should provide overall coverage of protection and privacy to any proposed federated machine learning system. To pave the way of this embedding in design lifecycle, we defined seven foundational principles similar to seven principles of privacy by design framework. We use a case study of COVID-19 imaging data. However, our proposed framework should be applicable to all types of data on a federated learning system. In our future work, we would implement and validate these principles for designing a computer vision-based COVID-19 diagnosis system based on pathological imaging data available from various teams around the world. We hope that the proposed framework with impact designing a privacy-preserving federated learning system with reduced complexity and sufficient data protection for collaborative research to combat COVID-19 challenge. Digital tools against covid-19: taxonomy, ethical challenges, and navigation aid. The Lancet Digital Health On the responsible use of digital data to tackle the covid-19 pandemic Covid-19 and health care's digital revolution Artificial intelligence vs covid-19: limitations, constraints and pitfalls Computer vision for covid-19 control: A survey X-ray image based covid-19 detection using pre-trained deep learning models Anwaar Ulhaq, Biswajeet Pradhan, Manas Saha, and Nagesh Shukla. Covid-19 detection through transfer learning using multimodal imaging data Covid-19 control by computer vision approaches: A survey Potential features of icu admission in x-ray images of covid-19 patients The eu general data protection regulation (gdpr). A Practical Guide Privacy by design: The 7 foundational principles. Information and privacy commissioner of Ontario Regulating privacy by design. Berkeley Tech Federated optimization: Distributed optimization beyond the datacenter Federated optimization: Distributed machine learning for on-device intelligence Federated machine learning: Concept and applications Analyzing federated learning through an adversarial lens Federated learning: Challenges, methods, and future directions Differential privacy: A survey of results Differential privacy and machine learning: a survey and review Privacy by design Secureml: A system for scalable privacy-preserving machine learning An optimized image fusion algorithm for night-time surveillance and navigation Deep learning with differential privacy Federated learning with differential privacy: Algorithms and performance analysis Blockchain-federated-learning and deep learning models for covid-19 detection using ct imaging Experiments of federated learning for covid-19 chest x-ray images Differentially private federated learning: A client level perspective Robust aggregation for federated learning Federated learning with bayesian differential privacy Modularity and design in reactive intelligence Privacy-utility tradeoff under statistical uncertainty