key: cord-0879945-2g7u8mvh authors: Alabbad, Dina A.; Almuhaideb, Abdullah M.; Alsunaidi, Shikah J.; Alqudaihi, Kawther S.; Alamoudi, Fatimah A.; Alhobaishi, Maha K.; Alaqeel, Naimah A.; Alshahrani, Mohammed S. title: Machine learning model for predicting the length of stay in the intensive care unit for Covid-19 patients in the eastern province of Saudi Arabia date: 2022-04-14 journal: Inform Med Unlocked DOI: 10.1016/j.imu.2022.100937 sha: eec265bb7ebdff2665a9e09920ff8988197c54f7 doc_id: 879945 cord_uid: 2g7u8mvh The quick spread of the COVID-19 virus worldwide turns it into a global pandemic. Managing resources is one of the biggest challenges that healthcare providers around the world face during the pandemic. Allocating the Intensive Care Unit (ICU) beds' capacity is important since COVID-19 is a respiratory disease and some patients need to be admitted to the hospital with an urgent need for oxygen support, ventilation, and/or intensive medical care. In the battle against COVID-19, many governments utilized technology, especially Artificial Intelligence (AI), to contain the pandemic and limit its hazardous effects. In this paper, Machine Learning models (ML) were developed to help in detecting the COVID-19 patients’ need for the ICU and the estimated duration of their stay. Four ML algorithms were utilized: Random Forest (RF), Gradient Boosting (GB), Extreme Gradient Boosting (XGBoost), and Ensemble models were trained and validated on a dataset of 895 COVID-19 patients admitted to King Fahad University hospital in the eastern province of Saudi Arabia. The conducted experiments show that the Length of Stay (LoS) in the ICU can be predicted with the highest accuracy by applying the RF model for prediction, as the achieved accuracy was 94.16%. In terms of the contributor factors to the length of stay in the ICU, correlation results showed that age, C-Reactive Protein (CRP), nasal oxygen support days are the top related factors. By searching the literature, there is no published work that used the Saudi Arabia dataset to predict the need for ICU with the number of days needed. This contribution is hoped to pave the path for hospitals and healthcare providers to manage their resources more efficiently and to help in saving lives. respiratory crackles [17] . Aktar et al. illustrate the strong relationship between abnormal blood parameters on the ALoS in decision-making. They also recommended that censored patients should be included in the 106 analysis along with discharged cases to reduce the bias rate. actual admission to the ICU. They found that the length of stay in the ICU for elderly people with heart 110 disease is high. Moreover, LoS in the ICU is affected by abnormal values of many factors, namely 111 lymphocyte absolute value, erythrocyte count, total cholesterol, adenovirus IgM antibody, hypersensitive 112 C-reactive protein, high sensitivity troponin I, and Q fever Rickettsia IgM antibody [23] . of Wuhan. They found that fever is not a permanent initial symptom of COVID-19 as only 70% of the study 116 sample reported this symptom. They found also that age, weight, gender, and career are not affecting the 117 length of hospital stay. They designed a multivariate model to predict the risk of a long hospital stay. Long 118 periods of hospital stay increase medical costs and increase the level of risk. Early estimation helps in 119 taking many decisions and allocating resources [24] . the LoS in the ICU, and mortality of COVID-19 patients. University Hospital in Dammam city in Saudi Arabia. The first step was to review different machine 144 learning algorithms and select the most suitable ones to be utilized in developing the prediction models. Next, data pre-processing was applied to recover the missing data and solve the imbalanced data problem They stated that the best prediction accuracy was achieved by using RF. XGBoost classifier can be used to 165 predict accurately the infection with breast cancer, and it can achieve the highest accuracy as discussed in 166 [28] . Also in [29] , GB and LR were utilized to predict the need for recurrent bleeding, therapeutic 167 intervention, and severe bleeding. The work demonstrated that the GB algorithm is a robust classification 168 that can handle large input sizes and fits with simple models to achieve higher accuracy. The propsed models in this work were implemented using Python programming language which provides 170 several tools for machine learning tasks. Below is a brief description of the used classifiers to build the proposed models. The random forest, as the name implies, is made up of a huge number of individual decision trees that work 174 together as an ensemble. It works to enhance accuracy by relying on a group of decision models rather than 175 a single learning model. The key distinction between this approach and traditional decision tree algorithms 176 is that the root nodes have splitting nodes that are produced at random [30] . The trees protect each other 177 from their flaws, which explains why they have such a strong effect. While some trees may estimate 178 incorrect classification, many others will be correct, allowing the trees to progress in the proper direction. As a result, the predictions, and thus errors, generated by individual trees must have minimal correlations 180 with each other for the random forest to perform well [31] . Furthermore, RF offers many advantages, such 181 as the ability to be utilized for both classification and regression tasks, and it can process missing variables. Gradient boosting is a type of boosting technique that is an ensemble mechanism for combining numerous 185 simple models into a single composite model. The entire model becomes a stronger predictor when 186 additional simple models are introduced [33] . Each model attempts to compensate for the flaws of its Each iteration combines the weak rules of each classifier to generate a single and strong prediction rule 189 [34] . Gradient boosting is a technique that can be utilized for both regression and classification tasks [33] . voting system [38] . It is only appropriate to employ this approach if the output of multiple classifiers  Designed to be used with large complex datasets and avoid model overfitting.  The method is scalable in all cases.  It can handle sparse data and also parallel and distributed computation which makes learning process faster and quicker [43] .  Always involves many classification and regression trees [44] .  Complex and computationally expensive [40] .  It is combined by weighted averaging or the voting of a collection of single classifiers.  The ensemble method combines multiple weak classifiers as a strong classifier. An empirical study shows that the price of building a base classifier is lower than the price of building a strong classifier.  It can maximize the information of the base learner and improve the overall ability of classification [45] .  The method robustness is affected by the quality of the dataset [45] . This section describes the dataset used in the research and the process of data preparation and pre-221 processing. The dataset was obtained from King Fahad University hospital in Dammam City at the eastern province of 224 Saudi Arabia. Since the start of the pandemic, this hospital was one of the main centers to receive and treat The data preparation and preprocessing involved two tasks: first, filling in the missing data, second, solving 240 the issue of imbalanced data. Since the dataset contained some missing values, the KNN imputation method was used to fill in these : k=3, k=5, k=7, k=10, k=15 . Similar to a lot of other datasets, this dataset is considered an imbalanced dataset. Figure 1 shows the 250 frequency of every class in the attribute "Days to discharge from the ICU". As shown, the majority of the 251 records belong to class 0 while class 6 and class 7 had only 30 and 12 records respectively. This may make 252 the model biased to class 0. The imbalanced dataset is considered a problem as the model will not be able Also, the dataset attributes were evaluated using the entropy evaluation to understand how the impurity or 264 the heterogeneity of the target class is computed. Table , 284 which also shows that 3-fold provides the highest performance. Results were compared in order to identify the highest performance model. The second goal of this work was to identify the features that are most relevant to patients' need for 355 ICU and their stay length. Therefore, the feature correlation with the number of days to discharge from the 356 ICU was extracted from the heatmap. Four prediction models were developed using four ML classifiers: Random Forest (RF), Gradient Boosting 559 (GB), Extreme Gradient Boosting (XGBoost), and Ensemble. The dataset was split into 2 folds, 80% for 560 training and 20% for testing. The same data was used with all models to compare the performance. A 561 number of experiments were conducted with different k-fold. Feature selection was applied using the Boruta 562 algorithm to reduce the number of features. However, using the full set of features proved to provide higher 563 accuracy. The conducted experiments show that the highest accuracy could be achieved by applying the RF model 565 for prediction as the accuracy was 94.16% which is high accuracy compared to the other published works 566 for the same purpose. Feature correlation with the length of stay in ICU was obtained, and the results The shortage of hospital beds for COVID-19 394 and non-COVID-19 patients during the lockdown of Wuhan, China A Closer Look Into Global Hospital Beds Denis Campbell Health policy editor, UK coronavirus victims have lain undetected at home for two COVID-19 ICU and mechanical ventilation COVID-19 Disease Severity Based on Clinical Blood Test Data: Statistical Analysis and Model Clinical Characteristics and Prognostic Factors for A Predictive Model for Patient Census and Ventilator Requirements at 457 COVID-19) Pandemic: A Preliminary 458 Technical Report Estimating lengths-of-stay of hospitalized 460 Epidemiol Clinical characteristics of Coronavirus Disease Early Prediction of Mortality, Severity, and Length of Stay in the Intensive Care 478 Unit of Sepsis Patients Based on Sepsis 3.0 by Machine Learning Models Prediction of COVID-19 confirmed, death, and cured 480 cases in India using random forest model A Risk Prediction Model for Type 2 Diabetes Based on Weighted Feature 483 Selection of Random Forest and XGBoost Ensemble Classifier Developing A Web based System for Breast Cancer Why Random Forest is My Favorite Machine Learning Model What is Gradient Boosting and How is it different from AdaBoost? Learn about boosting algorithms and how they What, IBM Cloud Learn Introduction to Gradient Boosting algorithm Extreme Gradient Boosting (XGBoost) Ensemble in Python Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discov. Data Ensemble Classification Classifier Ensembles Comparing different supervised machine learning 516 algorithms for disease prediction Accurate prediction of functional effects for variants by combining 519 gradient tree boosting with optimal neighborhood properties Improving the Techniques Performance Comparison of Machine Learning Classifiers for Fake News Machine Learning Model for Risk Improved prediction of protein-protein 533 interactions using novel negative samples, features, and an ensemble classifier Nearest neighbor imputation algorithms: a critical evaluation Addressing the Class Imbalance Problem in Medical Datasets