key: cord-0941380-rtmwo34n authors: Saadatmand, Sara; Salimifard, Khodakaram; Mohammadi, Reza; Marzban, Maryam; Naghibzadeh-Tahami, Ahmad title: Predicting the necessity of oxygen therapy in the early stage of COVID-19 using machine learning date: 2022-02-11 journal: Med Biol Eng Comput DOI: 10.1007/s11517-022-02519-x sha: 5d6ad19f8d11971024d1b7a4a23e6cd7378cd15b doc_id: 941380 cord_uid: rtmwo34n Medical oxygen is a critical element in the treatment process of COVID-19 patients which its shortage impacts the treatment process adversely. This study aims to apply machine learning (ML) to predict the requirement for oxygen-based treatment for hospitalized COVID-19 patients. In the first phase, demographic information, symptoms, and patient’s background were extracted from the databases of two local hospitals in Iran, and preprocessing actions were applied. In the second step, the related features were selected. Lastly, five ML models including logistic regression (LR), random forest (RF), XGBoost, C5.0, and neural networks (NNs) were implemented and compared based on their accuracy and capability. Among the variables related to the patient’s background, consuming opium due to the high rate of opium users in Iran was considered in the models. Of the 398 patients included in the study, 112 (28.14%) received oxygen-based treatment. Shortness of breath (71.42%), fever (62.5%), and cough (59.82%) had the highest frequency in patients with oxygen requirements. The most important variables for prediction were shortness of breath, cough, age, and fever. For opioid-addicted patients, in addition to the high mortality rate (23.07%), the rate of oxygen-based treatment was twice as high as non-addicted patients. XGBoost and LR obtained the highest area under the curve with values of 88.7% and 88.3%, respectively. For accuracy, LR and NNs achieved the best and same accuracy (86.42%). This approach provides a tool that accurately predicts the need for oxygen in the treatment process of COVID-19 patients and helps hospital resource management. GRAPHICAL ABSTRACT: [Image: see text] Since the spread of COVID-19 in December 2019, the healthcare sector has played a key role in combating the disease. The World Health Organization (WHO) on March 11, 2020, declared the COVID-19 a pandemic [1] . Hospitals as one of the main players have allocated a large part of their resources to deal with this disease. But, the growing number of patients due to the various variants of disease caused the lack of hospital resources such as ICU beds, medicine, and oxygen. Oxygen is a critical element in the treatment process of COVID-19 patients which according to WHO about 15% of cases require medical oxygen [2] . Decreasing respiratory failures caused by COVID-19 depends on the availability of oxygen and ventilation [3] . India was one of the countries, which lack of medical oxygen influenced its hospital service for COVID-19 patients [4] . To avoid the lack of supplies in hospitals during this pandemic, it is necessary to have an accurate and in-time prediction from required equipment like oxygen and ventilators. Artificial intelligence (AI) which has been used widely in medicine can detect and learn the non-linear relationship among variables and diagnose, treat, and predict the outcomes [5] . In healthcare, AI assists to realize the unknown patterns in data and make effective decisions accordingly [6] . This field of science has been applied to the COVID-19 pandemic for screening, analyzing, tracking patients, and making medical predictions [7] . Furthermore, machine learning (ML) as a subset of AI is used for computational epidemiology, early detection, diagnosis, and disease progression of COVID-19 disease as well as clinical management issues of this illness such as ICU admission, mechanical ventilation, multi-organ failure, and death [8, 9] . To make valid predictions in medicine, a supervised ML model requires a dataset containing a number of features and a relevant outcome [10] . For COVID-19, these features can be demographic information, symptoms, lab results, and the background of the patient. From various variables related to COVID-19 patient's background, such as diabetes, cancer, smoking, and kidney and liver diseases [11] , the effect of consuming opium on COVID-19 patients' needs more research [12] . Opium is one of the most common and popular drugs among Iranian people, which has been used for more than five centuries. This country has one of the highest rates of opium users in the world, which includes about 2.7% of its population [13] . In 2013 a study on a national scale was designed for evaluating the spread of substance abuse and opium addiction in Iran to zoning the country in low-to high-risk areas. The initial results of the study revealed that Kerman has the highest degree and is the province with the most opium users [14] . A survey on drug abuse in Kerman [15] indicated that the prevalence of substance abuse in the rural areas of Kerman was 22.5% and the rate of addiction was 6%. As prior mentioned, since the spread of novel Coronavirus, issues and problems related to the capacity of the health system to service the increasing number of patients has grown [16] . One of the basic requirements in the time of pandemic is to accurately predict the required resources and likely outcomes. Among the numerous required resources in the treatment process of COVID-19 patients, medical oxygen is an essential one [17] . Some studies were conducted to apply ML prediction models to predict the requirement for medical oxygen. Lee et.al [18] developed a prediction model that specifies the COVID-19 patients with the risk of requiring medical oxygen. They used the information of 221 patients with C-reactive protein, hypertension, age, and neutrophil and lymphocyte count parameters. Their model achieved a high AUC. To predict the need for mechanical ventilation, [19] used a cohort of 1980 COVID-19 patients. Their data include demographics, patient's background, vital signs at the emergency room, and laboratory data. Their results demonstrated that age and fever were associated with the risk of ventilator requirement. In another article [20] , a machine learning approach was used to predict the mechanical ventilation for COVID-19 patients. The input data included 12 clinical features of 197 COVID-19 patients collected from US hospitals. Their model predicted the mechanical ventilation requirement by applying blood factors and other variables like blood pressure and heart rate. In this research, a machine learning approach will be proposed to predict the requirement for oxygen-based treatment based on patient characteristics and clinical data. One of the main contributions of this research in compared to previous studies is to predict the outcome (oxygen requirement) in the initial time of patient admission at the hospital by only measuring the symptoms and patient's background without requiring lab results and further information. This can accelerate the process of resource planning, especially in the time of the peak of the disease and avoid shortage occurrences. The second novelty of this article, which is significant from a medical viewpoint, is assessing the impact of using opium on requiring the oxygen-based treatment and fatality rate of COVID-19 cases, using data collected from Kerman that have a high prevalence of opium users in Iran. The results can assist hospitals in forecasting the need for oxygen and managing this source effectively. In this section, the characteristics of the applied dataset and preprocessing operations on raw data will be described. Figure 1 illustrates an overview of the taken steps to build the prediction model. In the first phase, the required data of hospitalized COVID-19 patients were collected from hospitals. Next, the raw data were cleaned and preprocessing operations were applied. In the third phase, the relevant features were selected. In the fourth step, prepared data were split into train and test sets. Then, the train set was used as input for the numbers of machine learning models to train. After model training, the test set was applied to predict the outcomes. The prediction models were compared based on their accuracy and capability. Data for this study were collected from two local hospitals in Kerman province in the south of Iran. The data were acquired from the hospital database and written records of 398 hospitalized patients with positive COVID-19 tests (PCR) in a period of 6 months from February to July 2020. The admitted patients' information contained demographic data, patient's background, and symptoms of the disease. The average age of patients was 41.11 years old with the median and mode of 39 and 33 years old. The frequency of hospitalized male cases in the dataset was more than female and comprises 54% of the total records. The number of discharged cases and deaths were 377(94.72%) and 21(5.27%), respectively. The severity of disease, based on the patient's condition and symptoms, was divided into three categories: mild, moderate, and severe, in which 6% of the patients experienced severe disease conditions, while 94% experienced mild to moderate severity. Patients with mild condition were received only medication, the moderate group received medication and mask oxygen, and the severe cases, besides medication, used ventilators. Figure 2 demonstrates the flowchart of selecting cases for this study. From a total of 400 cases, two records including missing values were excluded. Of the 398 hospitalized COVID-19 patients, 28.18% received oxygen-based treatment which 13.39% of them were opioid-addicted. Non-oxygenrequired treatments were applied to 286 patients. The original dataset included 57 features. At the data cleaning phase, non-required data such as patient ID were removed; also the job variable due to variation and difficulty in job classification was omitted. After specifying the study's objectives, a consultant with medical specialists was conducted to determine the most relevant characteristics and features. The final features included demographic characteristics (two variables), patient's background (nine variables), disease symptoms (eight variables), and a target variable (type of treatment). Demographic information comprises gender and age. History of other diseases such as diabetes, blood pressure, and lung disease is in the patient's background class, and the last group includes initial symptoms of COVID-19 like cough, fever, and shortness of breath. The type of treatment was divided into two classes: oxygen-required treatment and non-oxygen treatment, which the first group included patients who used oxygen masks and ventilators besides medication, but the second group only received medication. Opium and its extracts and heroin were the four addiction-related variables that for each of them the start age of consumption, the amount of daily consumption, the number of daily usages, and type of use (orally taken or smoked) were collected. To add the opioid addiction variable to the database, these drugs, which are common among addicted people in Kerman, were combined as one binary variable and added to the dataset. Since the data were exclusively collected for scientific purposes, in order to be more accurate, the missing values in the electronic dataset were filled by available data in written records of patients, and only two incomplete medical records were omitted. Except for the age variable, the rest of the variables are binary and no outliers were observed. Five machine learning algorithms including logistic regression, neural networks, decision tree C5.0, random forest, and XGboost were applied to predict the requirement for oxygen-based treatment in COVID-19 patients. Logistic regression (LR) is one of the qualified models for binary outcomes in such fields as medical science especially in exploring the relationship between risk factors and the incidence of disease [21, 22] . In this paper, a multivariable LR with 19 predictors used to predict the oxygen and non-oxygen treatment for hospitalized patients with positive COVID-19. To fit the LR to the dataset, the iteratively reweighted least squares method was applied [23] . In a neural network (NN) algorithm, which is based on the nervous systems, the neurons represent the nodes in the algorithm that learn from the input data to optimize its final output [24] . The NNs for the purpose of this study were applied using one hidden layer, an output layer, and one input layer including 19 factor variables. The entropy fitting method is used to fit the NNs to the dataset. The maximum number of iterations and the maximum number of weight were set to 100 and 1000, respectively. C5.0 is the improved version of the C4.5 decision tree algorithm developed by Quinlan [25] . This algorithm is based on the ID3 algorithm which decreases the misclassification errors caused by noise in the training data set [26] . For this algorithm in our prediction model, the boosting iterations were set to ten and the trees decomposed into the rule-based model. Random forest (RF) [27] is a machine learning method that is normally used for classification and regression. The capability of matching with a wide range of prediction problems and the simplicity of parameter tuning are the two main reasons for the popularity of the RF algorithm [28] . To set the parameters for the proposed RF prediction model, the number of trees and the minimum size of terminal nodes were set to 200 and 1, respectively. The number of variables randomly sampled as candidates at each branch was set to ten. Implementing the Gradient Boosting concept, the XGBoost, provides a parallel tree boosting to solve a wide range of regression and classification problems fast and accurately. This algorithm applies a more regularized formalization to control over-fitting [29] . In this study, the maximum depth, number of rounds, and subsample ratio of columns for XGBoost were set to 1, 150, and 0.8, respectively. All parameters related to the five ML algorithms are displayed in Table 1 . The 19 predictor features were categorized into three classes (as shown in Table 2 ). Patients in the age category of 19-60 years old consisted 74.37% of all cases and In this study, information about the use of opium and its subsequences for hospitalized patients with positive COVID-19 tests was collected. Because of the prevalence of opium consumption in this province, opium-addicted cases included 6.53% of total samples. As it is illustrated in Fig. 3 , compared to females, the frequency of opioid addiction is higher in the male group. From the total number of hospitalized patients, 26 individuals were addicted which included 4 A point to consider is the high rate of mortality among opioid-addicted cases. The fatality ratio among non-addicted patients is 4.03%, while for opioid-addicted cases is 23.07%. The survival and death of both groups (addicted and nonaddicted) are illustrated in Fig. 4 . To determine the relationship between opioid addiction and oxygen-required treatment, the Chi-square test was conducted. The Chi-square test of independence applies to determine whether there is a relationship between two categorical variables. In this case, the null hypothesis assumes that there is not any relationship between addiction to opioids and the requirement of oxygen, and the alternative hypothesis assumes that there is an association between these two nominal variables. After computing the test statistics, it is found that p < 0.001 . Hence, the null hypothesis was rejected considering the confidence interval of 95%. It can be concluded that the relationship between these two variables is statistically significant. The basic dataset was randomly split into training and test set in a ratio of 80:20 with considering balanced data distribution. Ten-fold cross-validation method was applied to the training set to validate and evaluate the reliability of the developed models. ML algorithms including LR, NNs, C5.0, XGBoost, and RF were applied to build five different prediction models. To compare and evaluate the performance of models, accuracy, receiver operating characteristic (ROC) curves, Cohen's Kappa, balanced accuracy, confusion matrix, and the area under the curve (AUC) were calculated. ROC curve displays the performance of a classifier system when discrimination cut-off value changes over the range of the predictor variable. Higher points above the diagonal line refer to the better predictive value of the test [30] . The accuracy and kappa of applied ML models in the train set are displayed in Fig. 5 . For accuracy metric, LR and XGBoost obtained the maximum (90.90%), and RF and NNs with the value of 90.62% were in second place. The mean of accuracy for all models except NNs, which is 78.28%, was just above 80%. The kappa measurement was also calculated for all models. Kappa is a measure of inter-rater agreement [31] . In machine learning, it measures the level of agreement between the true values and the predicted values. The LR algorithm with the kappa of 0.7924 achieved the highest value, and the XGBoost with 0.7785 was in second place. RF and NNs with values of 0.0350 and 0.1428 obtained the minimum kappa, respectively. Figure 6 illustrates the ROC curves of five ML models in the test set. In comparing the performance of algorithms, XGBoost has the highest AUC followed by LR. NNs, C5.0, and RF was in third, fourth, and fifth positions, respectively. All five models demonstrate a desirable confidence interval result, ranging from 74.1 to 96.5%. The proposed approach was implemented in R software using libraries such as Caret [32] , ggplot2 [33] , and Liver [34] . The Caret environment consists of various machine learning models like NNs, RF, and LR. The overall runtime of the proposed model was 3.35 min, which makes it an appropriate tool for deciding on rush times. XGBoost had the longest and LR had the shortest runtime among all algorithms. In comparison to other schemes like making decisions based on historical data and previous experience, this approach assists decision-makers to decide more accurately in a shorter time. Sensitivity and specificity are two statistical performance metrics in using classification models or a diagnostic test. Sensitivity is the ability of the model to predict the true positives, while specificity evaluates the prediction of the true negatives by model [35] . To calculate them, the models' confusion matrices (Fig. 7) were used. Table 3 demonstrates the performance of models, using accuracy, kappa, sensitivity, specificity, and balanced accuracy. LR and NNs In a classification model, each variable has a specific impact on making predictions. Variable importance is a technique that indicates the relative importance of each input variable in a model prediction. The more important a variable, the more a model depends on it to make an accurate prediction [36] . The variable importance can be used to determine the most and least important variables to the model and improve the model's performance by dropping ineffective features. In NNs, XGBoost, and RF, age had the most score (100%), while in LR and C5.0, shortness of breath and cough with 100% relative importance were the most effective variables in prediction. LR and C5.0 algorithms have found the variable importance for feature age, 59.97% and 91.48%, respectively. Shortness of breath and cough are among the five most important features in four of the classification models (RF, LR, C5.0, NNs). Four variables include age, shortness of breath, cough, and fever which are common in the top five features in NNs, LR, and RF. For the shortness of breath, the relative importance in LR, C5.0, NNs, RF, and XGBoost were 100%, 100%, 82.06%, 42.46%, and 31.88%, respectively. Using C5.0, LR, NNs, RF, and XGBoost, the relative importance of cough were 100%, 75.75%, 70.95%, 70.95%, and 14.65%, respectively. Opioid addiction variable with scores of 5.694% in NNs, 32.578% in LR, 1.903% in RF, 92.11% in C5.0, and 0.8475% in XGBoost showed different behaviors in each model. Oxygen therapy is one of the main treatment choices for COVID-19 patients which reduces the fatality rate among critical cases [37, 38] . In our proposed approach, four features including shortness of breath, cough, fever, and age were identified as the most important variables in predicting the requirement for oxygen-based treatment in the early stages of admission. The association of shortness of breath and cough with receiving oxygen-based treatment has been addressed in various studies, which are the same as our results. Long et al. [38] analyzed the clinical information of 1362 COVID-19 patients of a local hospital in Wuhan. They found that most of the patients who experienced breathlessness, like shortness of breath, dyspnea, and chest tightness, received oxygen therapy. In another study in Ethiopia [39] , the longer duration of supplemental oxygen requirement was associated with shortness of breath. They also found that compared to patients without this symptom on admission, the degree of ending oxygen therapy was 29.5% lower in cases with shortness of breath. Ni et.al [40] concluded that dyspnea is among the related factors to oxygen therapy for COVID-19 patients under 65 years and can increase their need for oxygen. They found that 59.5% of patients with dry cough received oxygen therapy. Fever is one of the most common symptoms in patients with COVID-19 [41] . According to [42] , 43.8% of COVID-19 patients on admission and 88.7% during the hospitalization experienced fever. In our ML models, fever was an important feature in the prediction of oxygen requirement that can be considered as an early sign on admission. [40] found the relationship between fever and oxygen therapy. Among COVID-19 patients, 70.9% of those with fever symptom received oxygen therapy. It is shown that as the patient's age increases, the severity of COVID-19 cases increases [43] and also the risk of inhospital death [44] . In our algorithms, age was an effective factor in the prediction of oxygen-based treatment. Also, [39] recognized age as an important factor in the starting time of oxygen therapy, which is related to the longer duration of oxygen requirement among COVID-19 patients. The effect of consuming opium on the requirement for oxygen-based treatment was analyzed in this study. The data were gathered from Kerman province in the south-east of Iran, which has a high number of opium consumers [45] . This research had the opportunity to evaluate the impact of opioid addiction in the prediction model. About 58% of patients addicted to opioid received oxygen-based treatment including ventilator and oxygen mask. This rate for non-addicted individuals was much less (26.34%). Additionally, compared to non-addicted patients, the fatality rate among opium-addicted cases was high (28.57%). It proves the previous claim [46] that there is a higher death rate for COVID-19 opium users' patients. It is, probably, due to the negative impact of opium on the immune system and respiratory cells. There were several limitations to this study. First, the sample size of COVID-19 patients was small, especially for oxygen therapy and mechanical ventilation. Second, the data were collected from two hospitals in one province, which may influence the model reliability due to the variability in symptoms and other factors of disease between different populations. Not defining the exact time of needing oxygen-based requirement for each patient in the prediction model was the third limitation. It is due to the fact that data were limited. Fourth, available features were limited to the patient's background and symptoms with no information related to lab results. Future research can consider more features like lab results, vital signs of the patient, and CT images. In addition, a larger set of data needs to be used to build models that are more reliable. Apart from oxygen, the need for other COVID-19-related supplies such as medications and beds can be predicted. Further studies also may try to collect the data, based on a specific time interval to specify the demand time of supplies and equipment. In this study, information of hospitalized COVID-19 patients from two local hospitals in Iran was applied to predict the requirements for oxygen-based treatment. First, relevant attributes were selected based on experts' opinions, and then the model performed five ML classifications to predict oxygen requirement. The proposed approach found that the most important variables in predicting the need for oxygen therapy were age, shortness of breath, cough, and fever. One of the main objectives of this research was to predict the oxygen-based treatment in the early stages of patient admission, which, according to the results, the model indicated high accuracy and sensitivity in predicting the outcome. Among five ML algorithms, NNs and LR achieved high sensitivity (0.9273) and specificity (0.7308) that demonstrate their capability in predicting the need for oxygenbased treatment for COVID-19 patients. XGBoost showed the highest AUC (0.887). Another aim was to analyze the effect of consuming opium on the requirement for oxygen in COVID-19 cases. The results revealed the high rate of the requirement to this medical resource and high fatality ratio in this group of patients compared to other cases. In conclusion, the availability of medical resources especially in times of pandemic and the peak of the number of infected is an essential issue in managing hospital resources. Artificial intelligence tools like ML can help to accurately predict the need for medical supplies such as oxygen and avoid shortages. WHO Director-General's opening remarks at the media briefing on COVID-19 -11 Oxygen sources and distribution for COVID-19 treatment centres Oxygen provision to fight COVID-19 in sub-Saharan Africa Covid-19: Countries rally to support India through 'storm that has shaken the nation Artificial Intelligence in Medical Applications Application of Artificial Intelligence in COVID-19 Pandemic: Bibliometric Analysis Artificial Intelligence (AI) applications for COVID-19 pandemic Role of Machine Learning Techniques to Tackle the COVID-19 Crisis: Systematic Review Artificial Intelligence Applications for COVID-19 in Intensive Care and Emergency Settings: A Systematic Review Machine learning in medicine: a practical introduction People with Certain Medical Conditions, Centers for Diseases Control and Prevention Opium Addiction and COVID-19: Truth or False Beliefs Prevalence of drug use, alcohol consumption, cigarette smoking and measure of socioeconomicrelated inequalities of drug use among Iranian people: findings from a national survey Awareness and Attitude Towards Opioid and Stimulant Use and Lifetime Prevalence of the Drugs: A Study in 5 Large Cities of Iran The Household Survey of Drug Abuse in Kerman, Iran The Role of Artificial Intelligence in Fighting the COVID-19 Pandemic Predication of oxygen requirement in COVID-19 patients using dynamic change of inflammatory markers: CRP, hypertension, age, neutrophil and lymphocyte (CHANeL) Machine learning methods to predict mechanical ventilation and mortality in patients with COVID-19 Prediction of respiratory decompensation in Covid-19 patients using machine learning: The READY trial Logistic Regression: A Brief Primer Logistic regression Iteratively Reweighted Least Squares: Algorithms, Convergence Analysis, and Numerical Comparisons An Introduction to Convolutional Neural Networks C4.5: Programs for Machine Learning by Comparative analysis of decision tree algorithms Random decision forests A random forest guided tour XGBoost The receiver operating characteristic (ROC) curve A Coefficient of Agreement for Nominal Scales Building Predictive Models in R Using the caret Package Liver package: eating the liver of data science (version 1.10), CRAN Sensitivity and Specificity An Introduction to Statistical Learning A rapid advice guideline for the diagnosis and treatment of 2019 novel coronavirus (2019-nCoV) infected pneumonia (standard version) Effect of early oxygen therapy and antiviral treatment on disease progression in patients with COVID-19: A retrospective study of medical charts in China Duration of Supplemental Oxygen Requirement and Predictors in Severe COVID-19 Patients in Ethiopia: A Survival Analysis The independent factors associated with oxygen therapy in COVID-19 patients under 65 years old World Health Organization Clinical Characteristics of Coronavirus Disease 2019 in China Age-dependent effects in the transmission and control of COVID-19 epidemics Early predictors of in-hospital mortality in patients with COVID-19 in a large American cohort Is opium use associated with an increased risk of lung cancer? A case-control study The Reasons for Higher Mortality Rate in Opium Addicted Patients with COVID-19: A Narrative Review The authors would like to appreciate the anony- Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.Sara Saadatmand is a researcher in operations research and member of CIIORG at Persian Gulf University. Her research interests include machine learning and intelligent optimization.Khodakaram Salimifard is Associate Professor of Operations Research and leader of CIIORG at Persian Gulf University. His research interests include computational intelligence & intelligent algorithms. Professor of Statistic at the Business School, University of Amsterdam. His research interests including business analytics and machine learning.