key: cord-0899064-fsyylmo0 authors: Varzaneh, Zahra Asghari; Orooji, Azam; Erfannia, Leila; Shanbehzadeh, Mostafa title: A new COVID-19 intubation prediction strategy using an intelligent feature selection and K-NN method date: 2021-12-28 journal: Inform Med Unlocked DOI: 10.1016/j.imu.2021.100825 sha: ad9df522d95cc141ad05fa2977755d6666d4c90d doc_id: 899064 cord_uid: fsyylmo0 BACKGROUND: Predicting severe respiratory failure due to COVID-19 can help triage patients to higher levels of care, resource allocation and decrease morbidity and mortality. The need for this research derives from the increasing demand for innovative technologies to overcome complex data analysis and decision-making tasks in critical care units. Hence the aim of our paper is to present a new algorithm for selecting the best features from the dataset and developing Machine Learning(ML) based models to predict the intubation risk of hospitalized COVID-19 patients. METHODS: In this retrospective single-center study, the data of 1225 COVID-19 patients from February 9, 2020, to July 20, 2021, were analyzed by several ML algorithms which included, Decision Tree(DT), Support Vector Machine (SVM), Multilayer perceptron (MLP), and K-Nearest Neighbors(K-NN). First, the most important predictors were identified using the Horse herd Optimization Algorithm (HOA). Then, by comparing the ML algorithms' performance using some evaluation criteria, the best performing one was identified. RESULTS: Predictive models were trained using 12 validated features. Also, it found that proposed DT-based predictive model enables a reasonable level of accuracy (=93%) in predicting the risk of intubation among hospitalized COVID-19 patients. CONCLUSIONS: The experimental results demonstrate the effectiveness of the proposed meta-heuristic feature selection technique in combining with DT model in predicting intubation risk for hospitalized patients with COVID-19. The proposed model have the potential to inform frontline clinicians with quantitative and non-invasive tool to assess illness severity and identifying high risk patients. The Coronavirus disease 2019 (COVID-19) also known as severe acute respiratory syndrome coronavirus 2 (SARS-COV-2) has affected millions of people worldwide(1, 2). Approximately 15% -20% of symptomatic patient's onrush to serious complications such as severe pneumonia, Acute Respiratory Distress Syndrome (ARDS), cytokine storm syndrome, and Multi-system Organ Dysfunction (MOF) requiring Intensive Care Unit (ICU) hospitalization (3, 4) . Many hospital systems face extreme challenges with the extraordinary number of critical cases, causing many ICU departments to reach or overpass capacity (5, 6) . In response to this serious infection, the design and implementation of predictive models will be essential to the optimal use of limited ICU resources and support for clinical decisions (7, 8) . Physicians have reported problems in predicting the progression of COVID-19 in hospitalized patients, along with problems in diagnosing patients who are prone to rapid deterioration. This requirement is more accentuated, especially with regard to the unpredictability of the disease behavior and courses (9) . The COVID-19 patients who deteriorate and need critical support with their breathing, require mechanical ventilator (10) . Therefore there is an immediate demand for classifying cases to use respiratory intubation services (11) . Within a short span of the COVID-19 pandemic, many researchers have extensively interested in the introduction of new and non-invasive digital technologies such as Artificial Intelligence (AI) that can be effective in accurate and timely detection of patient at risk for clinical deterioration and severe hypoxia (low SPO2) (12, 13) . It is proven that these methods may facilitate the identification of high-risk patients and adopt the most effective supportive oxygen therapy (13) (14) (15) . Machine Learning (ML) as a subfield of AI (16) is an essential tool for clinicians to identify patients at high risk for severe disease and to prioritize their hospitalization and resource utilization. Therefore, it can help reduce patient mortality and reduce the burden of health care resources (7, 17) . In the prior studies, a large number of ML-based models were developed for predicting the risk of COVID-19 severity and patient illness deteriorating (18, 19) , ICU admission (17, (19) (20) (21) (22) , mechanical intubation (23) and deaths (17, 20, (24) (25) (26) (27) (28) (29) . In this paper, we retrospectively analyzed the data of COVID-19 hospitalized patients from Imam Khomeini hospital, Ilam, Iran. At first, multiple meta-heuristic feature selection methods are compared based on the K-NN classification algorithm to select the best intubation predictors in patients with COVID-19. Then we construct and compare several ML-based prediction models for J o u r n a l P r e -p r o o f predicting the COVID-19 patients' severity requiring respiratory intubation based on selected variables. More precisely, the study questions posed for the experiment are: what are the most relevant predictors for predicting the COVID-19 intubationa and 2-which prediction model presents better performance. In this study, a COVID-19 hospital-based registry data base from Imam Khomeini hospital, Ilam city, West of Iran, was retrospectively reviewed from February 9, 2020, to December 20, 2020. During this period, a total of 6854 suspected cases with COVID-19 had been referred to this center, of whom 1853 cases were introduced as positive COVID-19, 2472 as negative, and 2529 as unspecified. After applying exclusion criteria, for example negative RT-PCR COVID-19 test, unknown dispositions, discharged or death from emergency department, missing data > 70%, The proposed method in this paper has three main phases. In the first phase, the raw data of patients with COVID-19 are preprocessed so that they can be used in the data mining process. In the second phase, which is our main goal in this study, effective clinical factors for predicting mortality in patients with COVID-19 are identified. In this way, with the help of metaheuristic optimization algorithms, the most important features that achieve higher predictive accuracy are extracted. After The use of COVID-19 raw data sets in the data mining process causes the efficiency of the algorithms to be low and the experimental results to be of poor quality. Useful information that can be extracted from the data directly affects our model's ability to learn. Therefore, it is very important to pre-process our data before inserting it into the model. Pre-processing is imperative to address irrelevant, redundant and unreliable data and to resolve inconsistencies (30) . In this regard, several preprocessing methods are used to prepare the data in order to use them for the data mining process. Removing records with high missing rate improves the data mining accuracy and classification precision. It enhances learning efficiency, increasing predictive performance and reducing complexity of learning results (31, 32) . In this paper, records with more than 70% of missing data were excluded from analysis. Finally, for the remaining records missing cells were imputed by mean and mode values substitution for continuous and discrete variables, respectively. The dataset used in this research are not balanced in terms of the number of records in each data class. This problem, disrupts the performance of ML algorithms. Hence, various techniques are introduced to deal with unbalanced datasets. In order to balance the data, we use the Synthetic Minority Over-Sampling Technique (SMOTE). SMOTE produces synthetic samples of each minority class based on its nearest neighbors to increase the performance of the generalizer classifier on minority classes (33, 34) . To manage noisy data, the normal range of each variable is defined using the opinion of two infectious disease specialist, two virologist and hematologist. Then, we specified all the values that were outside the defined range and completed them by referring to the patient records or responsible doctor. Finally, normalization is used to reduce the variety of baseline between J o u r n a l P r e -p r o o f variables (35) . In this paper, MinMaxNormalization technique is used to scale dataset in (0, 1) interval as follow (33) . and are maximum and minimum values of each column, respectively. This step is performed on KNIME analytics Platform. Feature selection methods are one of the most important issues in ML and statistical pattern recognition (36, 37) and divided into three main categories including filtering, wrapper and hybrid methods (38) (39) (40) . Feature selection in a high-dimensional data set is one of the most important steps in ML, which eliminates redundancy and irrelevant features in the dataset. So far, various classical methods have been proposed for the problem of feature selection, but with the magnification of real-world problems and the importance of the speed of access to the answer, the quality of their answer is not appropriate. The use of intelligent optimization methods has provided an effective help in solving these problems. Therefore, it can be said that one of the most effective and constructive problem solving algorithms is the selection of features and their dependencies, the use of meta-heuristic optimization methods and evolutionary algorithms (41, 42) . Metaheuristic algorithms are a type of wrapper-based feature selection model in which ML algorithms are used to select optimal features. In other words, the criterion for selecting a feature is the efficiency of the classification algorithm. In this paper, several well-known meta-heuristic algorithms including Horse herd Optimization KNN can compete with the most accurate classification models and it is one of the most widely used machine learning algorithms, which is the best option for classification for most real-world problems. This algorithm is the best choice for problems where accuracy is important. Also, based on feature selection studies, the KNN algorithm is the best algorithm that can be used to calculate the fitness value (45) (46) (47) . The performance of the fitness value of meta-heuristic algorithms to find the features that have the highest classification accuracy with the least number of selected features is calculated by the following formula (Equation 1). Where ( ) is the classification error rate of KNN classifier. R is the number of selected features and C is the total number of features in the dataset, and are two parameters corresponding to the importance of classification quality and subset length. In this research, we set = 0.99 and = 0.01. In this phase, all experiments were performed using MATLAB 2019 software. To evaluate the performance of meta-heuristic algorithms in identifying the most effective factors, three performance evaluation metrics are calculated: mean fitness value, classification accuracy using KNN, and the number of selected features. As we can see in Figure 3 , the last step of the proposed method is to classify the data by the ML algorithms including K-NN, Multi-Layer Perceptron(MLP), Support Vector Machine(SVM) and Decision Tree(DT). In this step, ML algorithms are applied on dataset before and after running feature selection step. This step is performed on KNIME analytics Platform. The performance of classification algorithms was measured in terms of accuracy, precision, recall, specificity and Fmeasure. The calculation formula for each of these criteria is shown in Table 1 . Also, in order to evaluate the performance of each classifier 10-fold Cross-Validation is used in which the data set J o u r n a l P r e -p r o o f was divided into 10 independent subsets and each subset was considered as test data and other data as training data. In addition, Friedman statistical technique was used to compare algorithms more precisely and select an algorithm with the highest efficiency. This test assigns a rank to each algorithm and the best algorithm has a lower rating. Hypothesis zero states that all algorithms are the same. While rejecting the null hypothesis shows that the compared algorithms are significantly different. In this paper, we set the significance level to α = 0.05. Figure 3 shows the schematic workflow of this step which is performed on KNIME analytics Platform. A total of 6854 suspected cases with COVID-19 had been referred to Imam Khomeini hospital which after applying the exclusion criteria and removing records with more than 70% missing data, records of 1225 positive RT-PCR patients remained. In order to balance the data, we use SMOTE method is used. Before data balancing , "intubation" class contained only 176 records (13%), while after balancing the dataset, number of records in this class raised to 748. Then, using MinMaxNormalization technique all data normalized between 0 and 1. All ML algorithms were implemented on original and preprocessed dataset and the achieved results are shown in Table 2 . Since, it is difficult to compare performance of ML algorithms based on five different evaluation metrics, the Friedman statistical test is used to compare and rank ML algorithms on the basis on these evaluation criteria. The results of Friedman test with a significance level of 0.05 are also shown in Table 2 . The achieved p-values indicate that there is a significant difference in performance of ML algorithms. According to Table 2 , the DT with rank 7.6 generally performed better than other algorithms for original data set. The results shown in Table 2 revealed that the performance of ML algorithms in prediction of the need to intubation has improved significantly after preprocessing. According to Table 2 , KNN with mean rank of 1.4 had the best performance for preprocessed dataset. Additionally, ROC curves are plotted for all ML algorithms using preprocessed dataset. As shown in Figure 4 , KNN is the best model because the area under the ROC curve is close to 1. Table 3 . Numerical results show that the HOA algorithm is superior to other algorithms in terms of all three criteria. We are looking for an algorithm that selects the least number of features and at the same time, can achieve higher classification accuracy. The HOA algorithm selects 12 of the features as most effective risk factors for prediction of the need for intubation, including high age, high weight, dry cough, fever, dyspnea, loss of smell, cardiovascular diseases, hypertension, C-reactive protein, ALT/ASP, oxygen saturation (SPO2), and leukocytosis. The convergence diagram of meta-heuristic algorithms is also shown in Figure 5 . Among meta-heuristic algorithms, the algorithm with the lowest fitness value has higher performance than other algorithms. The results show that the efficiency of HOA algorithm with fitness of 0.083 is higher than other algorithms in finding the least number of features with the highest classification accuracy. Table 4 . Table 4 . Performance evaluation results of classification algorithms According to Table 5 , DT with a rank of 1.2 is the best classification algorithm for predicting the need of patients with COVID-19 to intubation. The SVM algorithm with a mean rank of 4 is weaker than other algorithms. By comparing the best models in Tables 2 and 4, it is concluded that although the accuracy, precision and specificity are reduced but recall and F-measure are improved, after feature selection. Given the wide range of clinical manifestations of COVID 19, it is important to develop models for estimating the likelihood of intubation using ML techniques (7) . In response to this lifethreatening infection, the design and implementation of Clinical Decision Support Systems (CDSS) will be critical to the optimal use of hospital limited resources and support for clinical decisions. CDSS equipped ML can assist clinical decisions by informing caregivers and recommending interventions based on objective and generalizable empirical data (7, 48, 49) . In this article, we analyzed the data from a hospital registry database to develop and evaluate models capable of predicting the need for respiratory intubation in hospitalized COVID-19 patients according to baseline clinical features. First, the efficiency of six feature selection methods was compared to identify the best predictors. The results show that the efficiency of HOA algorithm is higher than others in finding the least number of features with the highest classification accuracy. This study then adopted the most reliable and clinically relevant predictors related to intubation by using HOA method. Hence we identified 12 highly correlated variables with output class. Several The results of the present study may help clinicians throughout correct, accurate and timely diagnosis of the disease progression and reduce the severe complications and the resulting mortalities. Despite the small amount of data fed into the models and the lack of some important clinical variables, the selected ML models, especially DT algorithm, performed well. On the other hand, this model application in real clinical environments will assist physicians owing to its simplicity, user-friendliness and easy-to-use characteristics. Given the power of the current study in timely and accurate prediction of intubation risk, this study had some limitations that need to be addressed. First, this is a retrospective study that suffers from low data quantity (missing or duplicate cells) and non-optimal quality (imbalanced, noisy, and meaningless values). Second, we J o u r n a l P r e -p r o o f deal with a single-center dataset with limited sample size which undoubtedly confines the generalizability of the proposed model. Moreover, we used only four well known ML algorithms for prediction analyses based on some clinical features. Finally, the selected registry dataset lacks some important Para-clinical variables. In the future, the performance accuracy of our model and its generalizability will be enhanced if we test more ML techniques, at the larger, multicenter, and prospective dataset which is equipped with more qualitative and validated data. The main idea behind this research is to evaluate several Meta-heuristic feature selection algorithms and ML models to predict future risk of intubation among hospitalized patients with COVID-19. The present study may assist medical specialist in choosing the optimal supportive oxygen therapy in the critically ill patient with respiratory failure through identification and prioritizing predictors and ML based predictions. Our developed prediction model has the potential to provide frontline physicians with an easy and fast tool to classify COVID-19 patients without having to wait for the results of additional tests. This predictive model also may be an advantage in better care delivery, lessen clinician workload, and diminish severe complication and death in the COVID-19 patients. In the future work, the proposed method is expected to be applied to other medical and healthcare domains such as early diagnosis and treatment of chronic disease. The meta-heuristic algorithms used in feature selection can also be improved. The effect of COVID-19 derived cytokine storm on cancer cells progression: double-edged sword Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: a retrospective cohort study. The lancet CDC COVID-19 Response Team. Preliminary estimates of the prevalence of selected underlying health conditions among patients with coronavirus disease 2019-United States Lessons learned during COVID-19: Building critical care/ICU capacity for resource limited countries with complex emergencies in the World Health Organization Eastern Mediterranean Region Prioritisation of ICU treatments for critically ill patients in a COVID-19 pandemic with scarce resources Machine learning based clinical decision support system for early COVID-19 mortality prediction. Frontiers in public health Developing a clinical decision support model to evaluate the quality of asthma control level A consensus model to manage the non-cooperative behaviors of individuals in uncertain group decision making problems during the COVID-19 outbreak Emergency Tracheal Intubation in Patients with COVID-19: A Single-center DBNet: a novel deep learning framework for mechanical ventilation prediction using electronic health records Determining affected factors on survival of kidney transplant in living donor patients using a random survival forest. Koomesh Applications of artificial intelligence in COVID-19 pandemic: A comprehensive review Application of artificial intelligence in COVID-19 diagnosis and therapeutics Applications of artificial intelligence, machine learning, big data and the internet of things to the COVID-19 pandemic: A scientometric review using text mining Identifying factors that affect patient survival after orthotopic liver transplant using machine-learning techniques Mortality prediction model for the triage of COVID-19, pneumonia, and mechanically ventilated ICU patients: a retrospective study Utilization of machinelearning models to accurately predict the risk for critical COVID-19. Internal and emergency medicine Machine learning models for the prediction the necessity of resorting to icu of covid-19 patients Prediction model and risk scores of ICU admission and mortality in COVID-19 Development of a multivariate prediction model of intensive care unit transfer or death: A French prospective cohort study of hospitalized COVID-19 patients Prognostic Assessment of COVID-19 in the Intensive Care Unit by Machine Learning Methods: Model Development and Validation Development of a machine learning algorithm to predict intubation among hospitalized patients with COVID-19 Machine learning based early warning system enables accurate mortality risk prediction for COVID-19 Machine-learningbased in-hospital mortality prediction for transcatheter mitral valve repair in the United States Development and validation of a machine learning-based prediction model for near-term in-hospital mortality among patients with COVID-19 Federated Learning of Electronic Health Records to Improve Mortality Prediction in Hospitalized Patients With COVID-19: Machine Learning Approach Clinical features of COVID-19 mortality: development and validation of a clinical prediction model. The Lancet Digital Health An interpretable mortality prediction model for COVID-19 patients Data preprocessing in data mining Pregnancy-related anxiety during COVID-19: a nationwide survey of 2740 pregnant women. Archives of Women's Mental Health Prediction of Atrial Fibrillation Recurrence after Thoracoscopic Surgical Ablation Using Machine Learning Techniques Development of a prognostic model for mortality in COVID-19 infection using machine learning Abnormal resting state effective connectivity within the default mode network in major depressive disorder: A spectral dynamic causal modeling study. Brain and behavior Effective connectivity of mental fatigue: Dynamic causal modeling of EEG data. Technology and health care : official journal of the European Society for Engineering and Medicine The Beck Depression Inventory Second Edition (BDI-II): psychometric properties in Icelandic student and patient populations. Nordic journal of psychiatry Empirical study of seven data mining algorithms on different characteristics of datasets for biomedical classification applications Brain structural and functional changes in patients with major depressive disorder: a literature review Transcranial magnetic stimulation for the treatment of anxiety disorder. Neuropsychiatric disease and treatment Voxel-wise meta-analysis of task-related brain activation abnormalities in major depressive disorder with suicide behavior Neural Mechanisms in Eating Behaviors: A Pilot fMRI Study of Emotional Processing A Decision Tree Model for Breast Reconstruction of Women with Breast Cancer: A Mixed Method Approach Ghiasi MM, Zendehboudi S. Application of decision tree-based ensemble learning in the classification of breast cancer A new COVID-19 Patients Detection Strategy (CPDS) based on hybrid feature selection and enhanced KNN classifier. Knowledge-Based Systems A hybrid feature selection model based on butterfly optimization algorithm: COVID-19 as a case study A hyper learning binary dragonfly algorithm for feature selection: A COVID-19 case study. Knowledge-Based Systems Data Mining-Based Analysis of Chinese Medicinal Herb Formulae in Chronic Kidney Disease Treatment Scalable architecture for telemonitoring chronic diseases in order to support the CDSSs in a common platform Machine Learning Applied to Clinical Laboratory Data in Spain for COVID-19 Outcome Prediction: Model Development and Validation Early Prediction of COVID-19 Ventilation Requirement and Mortality from Routinely Collected Baseline Chest Radiographs, Laboratory, and Clinical Data with Machine Learning Prediction of respiratory decompensation in Covid-19 patients using machine learning: The READY trial. Computers in biology and medicine A machine learning prediction model of respiratory failure within 48 hours of patient admission for COVID-19: model development and validation Early prediction of COVID-19 ventilation requirement and mortality from routinely collected baseline chest radiographs, laboratory, and clinical data with machine learning Prediction of respiratory decompensation in Covid-19 patients using machine learning: The READY trial The authors declare that they have no conflicts of interest. Braam DH, Srinivasan S, Church L, Sheikh Z, Jephcott FL, Bukachi S. Lockdowns, lives and livelihoods: the impact of COVID-19 and public health responses to conflict affected populations -a remote qualitative study in Baidoa and Mogadishu, Somalia. Conflict and Health. 2021;15(1).