key: cord-0159443-d53urhqx authors: Priyanshu, Aman; Naidu, Rakshit title: FedPandemic: A Cross-Device Federated Learning Approach Towards Elementary Prognosis of Diseases During a Pandemic date: 2021-04-05 journal: nan DOI: nan sha: 4ee45e4100f949541c32b10135ff53dd06d2febf doc_id: 159443 cord_uid: d53urhqx The amount of data, manpower and capital required to understand, evaluate and agree on a group of symptoms for the elementary prognosis of pandemic diseases is enormous. In this paper, we present FedPandemic, a novel noise implementation algorithm integrated with cross-device Federated learning for Elementary symptom prognosis during a pandemic, taking COVID-19 as a case study. Our results display consistency and enhance robustness in recovering the common symptoms displayed by the disease, paving a faster and cheaper path towards symptom retrieval while also preserving the privacy of patient's symptoms via Federated learning. Symptom prognosis and analysis are important tools of pandemic management, as medical conditions of the population could be gauged with these tools. However, appropriate symptoms and their exact effects were reported after mass collection and analysis during COVID-19 (Ghosh et al. (2020) , Bennett & Carney (2011) ). This not only consumed time but also required an immense amount of manual effort to anonymize the continuously-growing large corpus of client data. In this paper, we propose FedPandemic, a novel approach towards the elementary prognosis of diseases during a pandemic by cross-device Federated learning. We present a novel tool towards prominent symptom detection while retaining client privacy during an outbreak. This encourages collaborative efforts between the general public, smaller healthcare clinics/facilities, Non-Governmental Organizations (NGOs), hospitals and large network medical institutions. Federated learning (McMahan et al. (2016) , Bonawitz et al. (2019) ) enables one to send models to where the data resides, rather than sending the data to the cloud thereby respecting the privacy of the users. Federated learning empowers distributed learning by gaining generalized insights over the active client space on decentralized data over a large number of rounds. FedPandemic employs Word Embeddings as feature extractors for a binary classification model, which is trained using the Federated Averaging (FedAvg) Algorithm (McMahan et al. (2016) ). The classifier is aimed to contribute towards preliminary medical examinations and prominent symptoms retrieval in the early stages of an outbreak. The model is developed in a mutable fashion to allow implementations of Secure Aggregation (Bonawitz et al. (2017) ) or Differential Privacy (Wei et al. (2020) ) for additional privacy use-cases. FedPandemic is trained based on the statistics of symptoms as reported by Statista's collection of COVID-19 symptoms in Kenya (Faria (2021) ), Germany (Koptyug (2021) ), Italy (Stewart (2020)), United States (Elflein (2020) ) and China (Thomala (2021)). The model employs and simulates different target clients with variable data sizes for learning. The implementation requires low computational prowess while still retaining high performance and client privacy making FedPandemic a potentially strong tool towards future symptom detection during an outbreak. We summarize five major problems presented in current symptom prognosis tools: (1) Time Consumption in centralized aggregation by a single institution. (2) Data Security of clients participating Prominent symptom detection is an integral part of pandemic management and control. If these symptoms are detected and retrieved at the earliest, the process of elementary prognosis will be facilitated faster. This may allow different governments to prevent the spread of such diseases. However, current technologies, require a large network of people maintaining and analyzing this data, which is quite expensive. With FedPandemic, we hope to overcome this problem using Federated learning to provide client privacy and low-cost maintenance based learning. We employ Federated learning in a Cross-Device system, as this enables general public to contribute individually. The Federated Averaging algorithm (McMahan et al. (2016) ) is used for generalizing the aggregated model. We utilize word embeddings for feature extraction on local devices which allows us to use State-Of-The-Art and also computationally resourceful encoders such as GloVe (Pennington et al. (2014) ) and Word2Vec embeddings. Word Embeddings produce a vector of fixed length as extracted features. This output is then fed into a client model, which is trained for a number of epochs E and then the weights of the updated model w i are returned to a centralised server. In this paper, we run multiple simulations on different contributors using GloVe (refer Table 1 ). A common word encoder is decided for implementation and a lightweight classifier is designed keeping in mind the embedder selected. This allows us to develop a model, while at the same time keeping computational costs low. The selected embedder (here, GloVe) and model architecture are declared for training and aggregation. However, only training on client symptoms would make the classifier biased. Hence, we randomly sample symptoms from a given medical corpus, which are then learnt as negative samples by the models (refer Figure 1 ). The proposed methodology allows us to keep a pseudo data balance, thereby making our models robust to bias and underfitting. We believe that we propose the first implementation for symptom aggregation on a large-scale application that entertains both client-privacy as well as distributed learning. The procedure allows us to overcome some important issues of symptom analysis: (1) Manual aggregation of data from multiple healthcare centres is not required. (2) Common Symptoms that would be easily identified by the public, such as, high temperatures, fevers, cough and cold, would also be treated with prominence, giving the general public a better chance of discerning the infected. (3) Retains client privacy; evading efforts required for data anonymization. (4) Word Embeddings also allow semantically similar symptoms to be treated with prominence. This may aid researchers to study additional symptoms that the affected might be exhibiting. The experiments were conducted on a single system, running multiple instances of client models. The system consisted of 8GB RAM and a GeForce GTX 1650, 4GB GPU. We leverage the PyTorch framework for our experiments and the base algorithm used for Federated Learning was FedAvg (McMahan et al. (2016) ). The classifier used in our experiments consisted of (50, 32, 16, 8, 1) neurons from the top layer to the bottom layer. For our experiments, the learning rate and batch size were chosen as 0.001 and 32 respectively along with the Adam optimizer. Our approach involved four simulations represented by different aggregation steps, which can be employed by local authorities. Our presentation takes statistical numbers from data; as published on Statista. We choose GloVe (Pennington et al. (2014) ) as our encoder in our experiments because its embeddings are light-weight and easy-to-use in a Federated learning environment when compared to embeddings from other State-of-the-art encoders like BERT (Devlin et al. (2018) ) and ELMO (Peters et al. (2018) ). In this work, we present four variants of simulations (see Table 1 This simulation aims towards reproducing aggregation by large medical institutes. In this simulation, we distribute the entire corpus, into 20 institutions or clients and train a federated model. Each institution has been given an equal number of sample cases (60,000 samples). This simulation is definite as large medical institutes will already have enough data to ensure that they can select which symptoms are prominent. • Simulation II: Medium Ranged Medical Institutes, like Hospitals, NGOs, etc. This simulation offers to cluster and pick symptoms from a larger collaborating group. However, even this group is large enough to accurately classify prominent symptoms. In this case, the data is not equally distributed and ranges between 10,000 to 20,000 samples. • Simulation III: Small Ranged Medical Institutes, like clinics and health care centres This simulation is the most practical one, as these institutes may be able to actively collaborate for training such a model. Each client will have samples ranging from 500 to 2,000. • Simulation IV: Individual/Family Contributions This simulation is the toughest to learn and provides the most realistic sample which could be implemented for the preliminary search of symptoms. Each client contains samples between 2 and 12. These simulations are pulled from the given distribution (refer Figure 5 ) and aim to replicate realworld usage of FedPandemic. We provide experimental results on a few prominent symptoms with different noise levels against the prediction output (refer Figure 2) . We experiment the random sampling step (refer Figure 1) with Normal and Laplacian Distribution values. The Laplace mechanism (with a paramter of 1 ) preserves -Differential Privacy (Dwork & Roth (2014) ). We vary different values of (taking 50% as the noise level as standard across all the simulations) to observe how our algorithm plays into Differential Privacy guarantees (as shown in Figure 3 ). We also display experiment results for different noise levels for target symptoms shown by greater than and lesser than 10% of the Survey Population (refer Figure A. 2). In this paper, we showcase a novel approach using Federated learning towards Elementary Symptom Prognosis in order to preserve client privacy and improve faster response times during a pandemic. Our experiments include various noise levels and the accuracy levels drop consistently as the noise values are increased (refer Table 3 ). Simulation IV displays highest output predictions as we evaluate over a large client space which signifies more personalized models (see Figure 2 ). We see that the Laplacian variant of our algorithm provides -DP (as increases, lower amount of noise is added which intuitively means higher utility and higher accuracy) in Figure 3 . We believe that either Simulation II and III could make best use of F edP andemic (given their size range and number of clients) with the 50% noise level readings by best replicating real-world situations. We hope to improve our method by making it robust to malicious attacks and Byzantine failures. We wish to improve the training model by incorporating data from other countries as well. Figure ? ? to display results with extreme noise levels) : 1. 10: Almost perfect simulation, with minimal amount of noise. The experiment can be thought of as a setting where only individuals infected with the novel coronavirus or the pandemic in consideration. There is little to no influence of any other symptoms the clients may have been facing during learning. 2. 25: Close to ideal simulation. Here the noise levels are increased by 25%. That is, an individual may report symptoms other than that from the disease in consideration with a 25% probability. This setting is closer to reality, as the general public would not know whether the symptoms they feel are relevant to the pandemic or not. 3. 50: Here noise levels have been set to 50%. That is, an individual may report symptoms other than that from the disease in consideration with a 50% probability. A step closer to reality and probably the closest, as most citizens are still healthy or if suffering would recognize new symptoms easily. However, noise would still be generated due to the large sample space. 4. 75: Noise levels are set to 75%. In this case the citizens have a higher chance of entering symptoms that are not related to the pandemic in question, however, due to the frequency presented in the total population, insignificant/not associated symptoms would be lost. 5. 90: If a person does not put a symptom related to the virus, they will put another arbitrary symptom. Using 90% noise level, we show that this is entirely based on the frequency analysis shown during a pandemic. The only reason our federated model is expected to converge and learn in such a case, is the sheer frequency of pandemic victims. Therefore, we specifically target our project towards pandemics like novel coronavirus. Algorithm 1 Noise Implementation 1: Input: 2: Survey Population ← People using the application; 3: Prominent COVID19 symptoms ← [cs 1 , cs 2 , ..., cs n ]; the actual symptoms displayed as per the research conducted by Statista. Ex: [Fever, Cough, Headache] 4: Symptom Probability ← [ps 1 , ps 2 , ..., ps n ]; probability of prominent symptoms that was displayed by the research conducted by Statista. Ex: [0.37, 0.2, 0.1] 5: Medical Corpus ← [s 1 , s 2 , ..., s m , cs 1 , cs 2 , ..., cs n ]; medical corpus present in GloVe which the public may identify. Ex: [Stomach Ache, Anemia, Red Eyes, ..., Fever, Cough, Headache, ..., ulcers, etc.] 6: random.random() ← random number sampled from the normal distribution N(0,1). Algorithm: 7: for i ∈ Survey Population do For the work presented in this paper, we employ simulations for data collection and cross-device setup. As there is no similar objective or dataset, the paper has taken the liberty to implement these simulations based on real-world data recordings. We take the data collected by Statista for COVID-19 symptoms in different countries. The data presented is a statistical representation of the % of people exhibiting a certain symptom. The total number of samples taken from the entire corpus makes up 1,226,465 people's data. The countries whose data has been used for simulation are: • Kenya • Germany • Italy • United States We present the statistics of each of these countries in Table 1 , and the aggregate statistics we pulled in Table 3 . The data presented in Table 3 has been visualized (as a bar graph) in Figure 5 . Table 4 : These are the 4 most prominent symptoms as well as the total number of participants from every country included in our dataset. In retrospect, these values represent the general display such symptoms as the total number of participates makes up 1.2 Billion people. The perturbations are generated from the same distribution in order to approach realistic sample spaces. Review paper: Pandemic preparedness in asia: A role for law and ethics Practical secure aggregation for privacypreserving machine learning Towards federated learning at scale: System design BERT: pre-training of deep bidirectional transformers for language understanding The algorithmic foundations of differential privacy Percentage of people with covid-19 in the united states from january 22 to may 30, 2020 who had select symptoms Knowledge of coronavirus (covid-19) symptoms among the kenyan population from How india is dealing with COVID-19 pandemic Most frequent symptoms caused by the coronavirus (covid-19) in germany in 2021 Federated learning of deep networks using model averaging GloVe: Global vectors for word representation