key: cord-0973139-sk6oac3o authors: Pourhomayoun, Mohammad; Shakibi, Mahdi title: Predicting mortality risk in patients with COVID-19 using machine learning to help medical decision-making date: 2021-01-16 journal: Smart Health (Amst) DOI: 10.1016/j.smhl.2020.100178 sha: e8047ea5ca4d973c3f41fa95b1ea6c321299420c doc_id: 973139 cord_uid: sk6oac3o In the wake of COVID-19 disease, caused by the SARS-CoV-2 virus, we designed and developed a predictive model based on Artificial Intelligence (AI) and Machine Learning algorithms to determine the health risk and predict the mortality risk of patients with COVID-19. In this study, we used a dataset of more than 2,670,000 laboratory-confirmed COVID-19 patients from 146 countries around the world including 307,382 labeled samples. This study proposes an AI model to help hospitals and medical facilities decide who needs to get attention first, who has higher priority to be hospitalized, triage patients when the system is overwhelmed by overcrowding, and eliminate delays in providing the necessary care. The results demonstrate 89.98% overall accuracy in predicting the mortality rate. We used several machine learning algorithms including Support Vector Machine (SVM), Artificial Neural Networks, Random Forest, Decision Tree, Logistic Regression, and K-Nearest Neighbor (KNN) to predict the mortality rate in patients with COVID-19. In this study, the most alarming symptoms and features were also identified. Finally, we used a separate dataset of COVID-19 patients to evaluate our developed model accuracy, and used confusion matrix to make an in-depth analysis of our classifiers and calculate the sensitivity and specificity of our model. In late 2019, a novel form of Coronavirus, named SARS-CoV-2 (stands for Severe Acute Respiratory Syndrome Coronavirus 2), started spreading in the province of Hubei in China, and claimed numerous human lives [1] - [3] . In January 2020, the World Health Organization (WHO) declared the novel coronavirus outbreak a Public Health Emergency of International Concern (PHEIC) [4] [5] . In February 2020, WHO selected an official name, COVID-19 (stands for Coronavirus Disease 2019), for the infectious disease caused by the novel coronavirus, and later in March 2020 declared a COVID-19 Pandemic [5] [6] . Coronavirus is a family of viruses that usually causes respiratory tract disease and infections that can be fatal in some cases such as in SARS, MERS, and COVID-19. Some kinds of coronavirus can affect animals, and sometimes, on rare occasions, coronavirus jumps from animal species into the human population. The novel coronavirus might have jumped from an animal species into the human population, and then begun spreading [7] . A recent study has shown that once the coronavirus outbreak starts, it will take less than four weeks to overwhelm the healthcare system. Once the hospital capacity gets overwhelmed, the death rate jumps [8] . Artificial Intelligence (AI) has been shown to be an effective tool in predicting medical conditions and adverse events, and help caregivers with medical decision-making [9] - [13] . In this study, we proposed a datadriven predictive analytics algorithm based on Artificial Intelligence (AI) and machine learning to determine the health risk and predict the mortality risk of patients with COVID-19. The developed system can help hospitals and medical facilities decide who needs to get attention first, who has higher priority to be hospitalized, triage patients when the system is overwhelmed by overcrowding, and eliminate delays in providing the necessary care. The algorithm predicts the mortality risks based on patients' physiological conditions, symptoms, and demographic information. The proposed system includes a set of algorithms for preprocessing the data to extract new features, handling missing values, eliminating redundant and useless data elements, and selecting the most informative features. After preprocessing the data, we use machine learning algorithms to develop a predictive model to classify the data, predict the medical condition, and calculate the probability and risk of mortality. The processed dataset and code have been released in hope to benefit the research community * . The rest of this paper is organized as follows: in section 2, we will introduce the different methods and model architecture. Discuss each method by providing detailed information about the model, data * https://github.com/mshakib/COVID-19.git J o u r n a l P r e -p r o o f preprocessing, and challenges that we encountered and the steps to mitigate these challenges, feature selection, and feature extraction. Section 3 provides the results with various approaches and metrics. Section 4 and 5 includes the discussion and conclusion. In this paper, we used a dataset of more than 2,670,000 laboratory-confirmed COVID-19 patients from 146 countries around the world [3] , including 307,382 labeled samples containing both male and female patients with an average age of 44.75 [3] . The disease confirmed by detection of virus nucleic acid [3] . The original dataset contained 32 data elements from each patient, including demographic and physiological data. At the data cleaning stage, we removed useless and redundant data elements such as data source, admin id, and admin name. We have also removed the unlabeled data samples. Then, data imputation techniques including mean/median/mode value replacement and KNN technique were used to handle missing values. To have an accurate and unbiased model, we made sure that our dataset is balanced. A balanced dataset with equal number of observations for both recovered and deceased patients was created to train and test our model. The data samples (patients) in the training dataset have been selected randomly and they are completely separate from the testing data. Figure 1 shows a high-level architecture of our system. The outcome label contained multiple values explaining the patient's health status. We considered patients that were discharged from hospital or patients in stable situation with no more symptoms as recovered patients. The symptoms were recorded by healthcare officials at the time of admission to the hospital. A total of 112 features were extracted from the original dataset including symptoms, doctors' medical notes, demographics, and physiological information. We consulted with a medical team to make sure that all of the relevant features are extracted. The next step is feature selection. The primary purpose of feature selection is to find the most informative features and eliminate redundant data to reduce the dimensionality and complexity of the model [11] . We used univariate and multivariate filter method and wrapper method to rank the features and select the best feature subset [11] . Figure 2 demonstrates the steps of filter and wrapper method that we used for feature selection. Filter methods are very popular (especially for large datasets) since they are usually very fast and much less computationally intensive than wrapper methods. Filter methods use a specific metric to score each individual feature (or a subset of features together). The most popular metrics used in filter methods include correlation coefficient, Fisher score, mutual information, entropy and consistency and chi-square parameters [11] . After selecting the best feature subset, we used various machine learning algorithms to build a predictive model. In this research, we used different algorithms including Support Vector Machine (SVM), Neural Networks, Random Forest, Decision Tree, Logistic Regression, and K-Nearest Neighbor (KNN) [15] [16] [17] . The Neural Network algorithm achieved the best performance and accuracy. We used grid search to find the best hyperparameters for the neural network. We searched for the following hyperparameters: the number of layers and neurons in each layer (in the range of 3 to 50), activation functions ('relu','logistic'), regularization rate, and batch size. The best neural network results were achieved with two hidden layers with 40 neurons in the first layer and 3 neurons in the second layer. We used stochastic gradient optimizer, constant learning rate and the regularization rate of alpha = 0.01. The SVM model was configured with linear kernel, and regularization parameter C=1.0. The Random Forest algorithm is an ensemble learning method combined of multiple decision tree predictors that are trained based on random data samples and feature subsets [17] . We configured the random forest algorithm with 20 trees in the forest. We used 10-fold random cross-validation (with no overlap, with no replacement) to evaluate the developed model. We calculated the Overall Accuracy for all machine learning algorithms to compare. Also, we generated Receiver Operating Characteristic (ROC) curves for every algorithm, and calculated the Area Under Curve (AUC) and Confusion Matrix. Again, we made sure that there is no overlap (no common patient) between training and testing datasets at any level. We have also performed another feature selection during the cross validation and only on the training data to confirm the results, and the selected features match the original feature selection. The next section will provide the results and performance of the developed system. As explained in section II, several metrics such as Accuracy, ROC, AUC, and Confusion Matrix have been used to evaluate the developed model. Table 2 The results demonstrate that the developed algorithm is able to accurately predict the mortality risk in patients with COVID-19 based on the patients' physiological conditions, symptoms, and demographic information. Figure 6 shows the mortality risk (the probability of death) predicted by the algorithm for sample J o u r n a l P r e -p r o o f patients. In this study, we processed a large dataset of COVID-19 confirmed cases collected from all around the world, and used state of the art machine learning algorithms to predict the mortality rate for patients with COVID-19. We evaluated the developed algorithms using several different metrics. The evaluation results demonstrate high accuracy and the effectiveness of the developed models. There are other studies that have shown promising results for predicting mortality rate in COVID-19 patients using blood lab results and clinical data [18] . However, in our study, we focused on demographic information, physiological data, patient's symptoms, and pre-existing conditions. We reached an outstanding accuracy of 89.98% using neural network model. Furthermore, as previous studies mostly focused on data collected from China [18] [19], we used the hospital data from all around the world to create a more comprehensive model that is applicable to the world population, and is not trained only based on the data of one particular region. The purpose of this study was to create a predictive algorithm to help hospitals and medical facilities maximize the number of survivors by providing an accurate and reliable tool to help medical decision making and triage COVID-19 patients more effectively and accurately during the pandemic. Our algorithm is able to predict the mortality risk in patients with COVID-19 with high accuracy using the patients' physiological conditions, symptoms, pre-exiting conditions, and demographic information. This system can help hospitals, medical facilities, and caregivers decide who needs to get attention first before other patients, triage patients when the system is overwhelmed by overcrowding, and also eliminate delays in providing the necessary care. This study could expand to other diseases to help the healthcare system responds more effectively during an outbreak or a pandemic. Early Transmission Dynamics in Wuhan, China, of Novel Coronavirus-Infected Pneumonia Epidemiological data from the COVID-19 outbreak, real-time case information Pneumonia of unknown aetiology in Wuhan, China: potential for international spread via commercial air travel Emergency Committee regarding the outbreak of novel coronavirus (2019-nCoV), World Health Organization (WHO) WHO Director-General's opening remarks at the media briefing on COVID-19 When does Hospital Capacity Get Overwhelmed in USA? Germany? A model of beds needed and available for Coronavirus patients Interactive Dimensionality Reduction for Improving Patient Adherence in Remote Health Monitoring Risk Prediction of Critical Vital Signs for ICU Patients Using Recurrent Neural Network Multiple model analytics for adverse event prediction in remote health monitoring systems Interactive Predictive Analytics for Enhancing Patient Adherence in Remote Health Monitoring Multi-label Classification of Single and Clustered Cervical Cells Using Deep Convolutional Networks Context-Aware Data Analytics for Activity Recognition Machine Learning The Nature of Statistical Learning Theory Random Forests Hypertension: from basic research to clinical practice Diabetes and Hypertension: Is There a Common Metabolic Pathway? J o u r n a l P r e -p r o o f ☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:J o u r n a l P r e -p r o o f