key: cord-0019875-uu2mypyr authors: Shahid Ansari, Md.; Jain, Dinesh; Harikumar, Haripriya; Rana, Santu; Gupta, Sunil; Budhiraja, Sandeep; Venkatesh, Svetha title: Identification of predictors and model for predicting prolonged length of stay in dengue patients date: 2021-08-14 journal: Health Care Manag Sci DOI: 10.1007/s10729-021-09571-3 sha: cdf7d398ede1459a73b2b230f499d6b096482c6d doc_id: 19875 cord_uid: uu2mypyr Purpose: Our objective is to identify the predictive factors and predict hospital length of stay (LOS) in dengue patients, for efficient utilization of hospital resources. Methods: We collected 1360 medical patient records of confirmed dengue infection from 2012 to 2017 at Max group of hospitals in India. We applied two different data mining algorithms, logistic regression (LR) with elastic-net, and random forest to extract predictive factors and predict the LOS. We used an area under the curve (AUC), sensitivity, and specificity to evaluate the performance of the classifiers. Results: The classifiers performed well, with logistic regression (LR) with elastic-net providing an AUC score of 0.75 and random forest providing a score of 0.72. Out of 1148 patients, 364 (32%) patients had prolonged length of stay (LOS) (> 5 days) and overall hospitalization mean was 4.03 ± 2.44 days (median ± IQR). The highest number of dengue cases belonged to the age group of 10-20 years (21.1%) with a male predominance. Moreover, the study showed that blood transfusion, emergency admission, assisted ventilation, low haemoglobin, high total leucocyte count (TLC), low or high haematocrit, and low lymphocytes have a significant correlation with prolonged LOS. Conclusion: Our findings demonstrated that the logistic regression with elastic-net was the best fit with an AUC of 0.75 and there is a significant association between LOS greater than five days and identified patient-specific variables. This method can identify the patients at highest risks and help focus time and resources. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s10729-021-09571-3. Dengue is the fastest-growing mosquito-borne disease across the world today [1] . It is a mosquito-borne viral infection that affects infants, young children, and adults. This infection is transmitted by a mosquito bite infected with one of the four serotypes of the dengue virus. Aedes aegypti is the main vector in most of the urban areas of India and Asia. Aedes albopictus is also found as a vector in few areas of southern and eastern India. The World Health Organization (WHO) estimates that nearly 400 million infections occur every year in over 128 countries in Asia, Oceania, America, and Africa. It is evident from the reports that about half of the world's population is currently at the risk of dengue transmission [1] . Dengue in India has spread significantly over the past few decades, with rapidly changing epidemiology. According to the data from the Directorate of National Vector Borne Disease Control Programme (NVBDCP) [2] and National Health Profile 2018 [3] , in 2017, the spike in cases of dengue was the highest in the last one decade. From less than 60,000 cases in 2009, cases increased to 188,401 in 2017, more than a 300 percent spike. When compared to 75,808 cases in 2013, it is more than a 250 percent spike (Fig 1) . The number of outbreaks has risen, and certain states and union territories have become hyperendemic [4] . In 2015, the Indian capital region, Delhi, recorded its worst outbreak since 2006 with over 15,000 confirmed dengue cases [5] . Dengue diseases are characterized by a prolonged length of stay (LOS). Prolonged hospitalization is associated with adverse outcomes for the patients and the hospital, such as high complications, poor outcomes, and high care cost that creates a significant economic burden for the hospital [6] . The overall cost of dengue in 2016 was about US$5.71 billion and US$1.51 billion in 2013 [7] . There has been considerable interest in controlling the use of hospital resources, particularly in dengue diseases; thus, hospitals try to make LOS as short as possible. The LOS can be used as an overall parameter to identify health care resource utilization, healthcare cost, and, severity of illness [8] . Therefore, predicting patients which need the most aggressive early intervention and those who require a moderate amount of intervention to prevent prolonged LOS seems to be crucial. There have been other studies [9] [10] [11] [12] [13] that conducted prediction of LOS for other diagnoses with data mining techniques, but very few studies [14] [15] [16] have addressed the dengue LOS in hospitals. We apply data mining techniques to extract useful knowledge and to estimate the LOS for dengue patients. In this paper, we present a system to predict the hospital LOS of patients with confirmed dengue diagnosis. Our contributions are listed as follows: -We collect and examine available information for confirmed dengue patients admitted at Max group of hospitals in the National Capital Region (NCR) of India. -We propose strategies to handle missing values in the collected data. -We predict the LOS of Dengue patients with an encounter at one of Max group healthcare systems at NCR. -We investigate the factors that can be assessed to predict the LOS of dengue patients. -We predict patients which need the most aggressive early interventions, and those patients who require a moderate amount of interventions. We use logistic regression (LR) with elastic-net and random forest classifiers for the prediction and identification of the important factors associated with the dengue patient data. We internally validate our results with evaluation methods such as recall, precision, and AUC. We seek help from domain experts in the medical domain to validate the results. The experiments show the usefulness of our method in predicting the LOS and identification of predictive factors for classification. Identification of predictors and model for predicting prolonged... Acharya et al. [17] , have done a prospective cross-sectional study in a total of 364 patients with immunoglobulin m (IgM) dengue serology positive who were admitted to a tertiary care hospital with features of dengue fever. The authors found that the factors such as Age >40 years, presence of hypotension, platelets <20,000 cells/mm 3 , alanine aminotransferase (ALT) >200U/L, aspartate aminotransferase (AST) >200U/L, prolonged prothrombin time, presence of renal failure, encephalopathy, multiple organ dysfunction syndrome (MODS), acute respiratory distress syndrome (ARDS) and bleeding tendency (p-value <0.05) have a significant influence with increased risk of mortality among the dengue patients. In a separate study, Md-Sani et al. [18] , built a logistic regression based on a data set of 199 adult patients hospitalization in Kuala Lumpur Hospital, Malaysia. The study identified lethargy, bleeding, pulse rate, serum bicarbonate, and serum lactate to be statistically significant predictors of death. Jain et al. [19] , conducted a study to identify the factors that influence dengue-related mortality and disease and found that age, sensorium, and dyspnea have a significant influence on mortality and severity. Wiratmadja et al. [16] , conducted a study to predict hospital LOS of dengue patients using demographic and illness or health-related data set of 370 dengue fever (DF) and dengue haemorrhagic fever (DHF) patients in Bandung, Indonesia. The study identified systolic blood pressure, diastolic blood pressure, haematocrit, leucocytes, lymphocytes, monocytes, and comorbidity score as the most significant predictors. Chakravarty et al. [14] , conducted studies with patients admitted with dengue fever in the Paediatric department in Northern India to determine the clinical and laboratory features and found predictive factors for the prolonged hospital admission. A cross-sectional retrospective study to determine mortality and prolonged hospital stay among patients with confirmed dengue diagnosis based on a data set of 667 hospitalizations was done by Mallhi et al. [15] . The study showed that DHF, elevated alkaline phosphatase (ALP), prolonged prothrombin time (PT), activated partial thromboplastin time (aPTT) and multiple-organ dysfunctions are associated with prolonged hospitalization. Similarly, studies related to LOS of other diagnoses are, Liu et al. [20] , who conducted a comparative analysis to predict LOS which was tested on Geriatric and stroke data sets based on two classification algorithms. Hachesu et al. [12] , compared three classification algorithms to predict LOS of heart patients and found SVM was the best fit. Combes et al. [11] , explored the prediction of hospital stay in the emergency department using regression models, Blais et al. [10] derived a prediction model as a screening and rating tool using multivariate analysis to quantify variables related to LOS for an acute care medical psychiatric unit, and Azari et al. [9] , designed an approach to predict hospital LOS by clustering datasets and using various classifier models such as Bayes net [21] , SVM [22] , JRIP [23] , J48 [24] , and Bagging [25] . Most of the prediction studies in dengue disease have attempted to classify DF and DHF or in-hospital mortality [19] . However, very few studies have addressed the LOS prediction problem in dengue patients. The cohort includes patients who have been hospitalized during the study period under the department of Internal Medicine between February 2012 and September 2017. We identified 1360 patients who were admitted to Max group of hospitals in India with dengue disease-related diagnosis. The research study was approved by the institutional Max Healthcare Ethics Committee. We used standard WHO definitions to classify suspected dengue infection [26] . A total of 1148 microbiologically confirmed dengue patients are included in this study. Dengue confirmation is done using two methods, Patient data are stored in a hospital database management system of Microsoft SQL server database. We extracted data in three phases. In phase 1, demographic and confirmed dengue diagnosis patients were extracted. Information related to administrative and investigations were extracted in phase 2. Then, radiological, procedure, clinical related data were collected in phase 3. We constructed a new data set for hospital LOS of dengue patients from the extracted information. However, 212 patients were removed from the analysis because of the unavailability of platelet count test information, leaving 1148 patients in the final data set. In the first screening of the features, 40 features were selected, including age, gender, type of admission, blood transfusion, assisted ventilation, lab, and radiological related features of dengue patients, using the data available for 24 hours of hospitalization. Units, value range, and missing percentage of each feature are given in Table 1 . Demographic and clinical details are recorded at admission in a predesigned pro-forma, whereas laboratory findings are recorded daily until discharged or dead. The dataset contains demographic, administrative, investigation, and radiological characteristic features with categorical and numerical values. We categorize numerical and categorical Table 2 . These features are converted to categorical variables to improve interpretability. The target feature, LOS in the initial data set could take 28 different values. Figure 2 illustrates the distribution of length of stays in terms of count and percentage from February 2012 to September 2017. The most frequent LOS is 4 days (293, 25.5%) and the least LOS is 1 day (14, 1.2%). In this experiment, we binned the LOS values into two classes to build robust predictive models. Usually, patients with dengue infection have an average hospital stay between 3 and 5 days [14, [27] [28] [29] [30] [31] . We used >5 days as a cut-off point for prolonged hospitalization (median LOS in the present study is 4.03 ± 2.44 (median ± IQR)). We divided LOS into two different functional groups: First, we merged LOS of 1 to 5 days into one bin labeled '≤5 days' and is coded as a 0 representing those patients for whom the moderate amount [9] of intervention is required to reduce LOS. Second, we pull all the length of stays longer than 5 days into the second bin labelled '>5 days' and is coded as a 1. The patients in the second group are the most in need of early aggressive intervention [9] to prevent long -term hospital admissions. All patients (children, and adults) for whom there was a serologically confirmed dengue infection were included. A case was excluded if dengue serology was negative for febrile illness patients. As a step of data cleaning, we removed duplicate records and fields with more than 50% missing data. Secondary use of Electronic Health Record (EHR) data can be challenging, because the patient records within the EHR may be inconsistent and incomplete. The presence or absence of information, the timing, and other characteristics of the collected data may vary considerably from patient to patient. EHR data, especially for laboratory measurements, often contain missing values due to various reasons such as time and cost constraints [32] . While hospital systems are capable of capturing the entirety of data measurements, some patient data are still found missing from databases [33] . The rates of missing data in the EHR have previously been reported from 20% to 80% [34, 35] . In this study, extracted data have many laboratory measurements that are missing at any given hour during the first 24 hours of a patient's hospital admission (Table 1 ) and in this case, data may be missing not at random because measurements are taken at different schedules and frequencies. These shortcomings make it harder for algorithms to capture patterns in medical data sets. One approach to handle incomplete data is to discard all cases consisting of missing values; however, this can potentially remove a significant portion of training data and is generally not desirable. Alternatively, a more common approach is to apply data imputation. We followed three steps to handle the missing values which are as follows, 1. Feature removal: If a feature has more than 50% of records with missing values, such as PT, APTT, and basophils then they were determined not to be an effective feature in the analysis and as a result, were removed (Table 1 ). 2. Median imputation: If a continuous feature has less than 50% missing values, then the median values of records were replaced instead of missing values (Table 1) because non-normality and some outliers were detected in these features. 3. Regression imputation: Regression imputation using R software was applied to those features that were in nominal or ordinal type (Table 1) . We imputed the missing values of these features using the regression imputation model with the test AUC score shown in Fig. 3 . The high values of AUC provide good confidence in the imputed values. The statistical impact of missing data is evaluated and the details and statistical results of imputed variables are available in the Supplemental Material (see Tables 1, 2 Feature engineering is a key task in data preparation but a work-intensive component of machine learning applications [36] . Initially, we collected 40 predictors using the available data for 24 hours of hospitalization, which included age, admission type, several predictors related to the investigation data, blood transfusion, assisted ventilation, and radiological related predictors of dengue patients and then we generated new features from these predictors. For feature generation of lab data, we used the standard lab reference range provided by Max healthcare system and coded them based on whether it is below within or above the reference range ( Table 2 ) and for other features, we categorized them into binary variables. After completion of the feature generation process, we prepared a list of 389 independent variables and one dependent variable including administrative, demographics, pathology, radiology, and procedure. After pre-processing the data, the final dataset was randomly split into two subsets, a training set (70%) and a testing set (30%). The selection of relevant attributes may also benefit from domain knowledge. Based on studies conducted in [15, 20, 37, 38] , factors that often appear in dengue-cases have been selected as initial attributes then these attributes were validated by clinicians to ensure that no unrelated factors are used as predictors. We use different techniques such as information value [39] , variable importance using random forest [40] , recursive feature elimination using logistic regression [39] , chisquare [39] , and L1 [39] feature selection methods to select variables and then finally each technique voted whether they selected the variable. As a final step, the vote was counted and the variables with higher votes were used in the modelling process. We removed sensitive patient information such as episode and location ID from the dataset. We use LR with elastic-net [41] model that utilized EHR data to predict the LOS. It is a regularized regression technique that linearly combines the L1 and L2 penalties of lasso [42] and ridge methods. L1 regularization helps in sparsifying the weight vectors, while L2 regularization limits the weight value to protect against outliers. Together, elastic-net can find a stable and sparse weight vector for logistic regression [41] . The elastic-net estimatorβ is defined as, -N is the number of observations y i is the binary response at observation i -X i is data, a vector of d values at observation i λ 1 and λ 2 are positive regularization parameter which interpolates between L1 and L2 norm β -The parameter β is a coefficient of features The LOS probability for Dengue hospitalization can be formulated as: where X i are independent variables and P is the probability of prolonged LOS (>5 days) following dengue infection. We did a grid search of 100 values for different values of λ 1 and λ 2 and selected the best with the lowest crossvalidation error. Random forest [40] is an ensemble of multiple decision trees. In a decision tree each internal node (non-leaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a class label [43] . The process of building a random forest involves constructing individual decision trees from bootstrap samples of the data using only a subset of predictors in each node of each tree. In our experiments, the random forest model is tuned via 10-fold cross-validation over 10 combinations of hyperparameter values (number of decision trees, number of features). We set the default values for the algorithm e.g. 100 for the number of trees, Gini index for splitting and computing variable importance, five observations are set as the minimal number of observations required for forming terminal nodes, and the square root of the number of variables is used to split each node. The algorithms were executed using R, an open-source software application for statistical computing and data mining [23] . Glmnet and caret libraries were used for LR with elastic-net and random forest algorithms respectively. We also compared other methods like support vector machine (SVM) and extraTrees to predict LOS but these were excluded because of unsatisfactory predictive performance. We used 70% of the data for training and the remaining 30% for testing. For fine-tuning of the model parameters, we used 10-fold cross-validation on the training dataset; the training data was first divided into ten folds, nine folds were used to train the model, and the remaining fold was used to assess the model performance/generalizability. The Kruskal-Wallis H test is used to check the statistically significant differences between two or more groups of an independent variable on a continuous dependent variable [44] . For checking the association between two categorical varables, the Chi-square test [45] is used when the expected frequencies are higher than 5, whereas Fisher's exact test [46] is performed when the expected table values are smaller than 5. Statistically significant differences are determined by p-value <0.05 ( Table 3 ). The results have been summarized in terms of means ± standard deviation and median ± IQR for continuous features. Performance measures: We assess a set of performance measures including sensitivity or recall, precision or positive predictive value (PPV), and AUC for each model. We use traditional performance measures for classification that are based on the four values of the confusion table: true positive (TP), false positive (FP), true negatives (TN), and false negatives (FN). We use these values to compute a positive predictive value (PPV) or precision, negative predictive value (NPV), sensitivity or recall, and specificity as in Eq. 3, 4, 5, and 6. In addition, the Receiver Operating Characteristic Curve (ROC) is graphed and the areas under the ROC (AUC) [43] are analyzed. Of the 1148 dengue confirmed cases, 974 (84.8%) belonged to the adult's age group (>12 years) and 174 (15.2%) to the paediatric age group (≤12 years) in this study. Larger proportions of positive cases were observed among adult cases. The majority of the dengue cases were noted in the age group of 10-20 years (21.1%), where there was a male predominance. The next majority of cases were (Fig. 4) . The distribution of males and females across the different age groups was statistically the same (p >0.05). The mean age of 1,148 patients was 58.2 ± 13.0 years (SD) (aged >0 to ≤87), the majority (60.3%) males ( Table 3 ). The mean LOS was 4.03 ± 2.44 (median ± IQR). Prolonged hospitalization (>5 days) was seen in 32% (364/1148) of patients while LOS was ≤5 days among 68% (784/1148) of patients. The characteristics of patients with or without prolonged LOS were compared and shown in Table 3 . Considering >5 days as "prolonged LOS", serum creatinine, platelet count, total leucocyte count, alanine aminotransferase (ALT), aspartate aminotransferase (AST), haematocrit (%), haemoglobin (g/dL), assisted ventilation, blood transfusion, and admission type emergency were identified as highly statistical significant independent Table 3) . Platelet count was done in all the confirmed cases out of which 72 patients (6.3%) had platelet count ≤20,000 (severe thrombocytopenia), 249 patients (21.7%) ranged from 20,000 -50,000 (moderate thrombocytopenia), 245 patients (21.3%) ranged from 50,000 -1,00,000 (mild thrombocytopenia) while the remaining 582 patients (50.7%) were above 1,00,000 (Fig. 5 ). Of these cases, 49.3% had thrombocytopenia (<1,00,000) while the remaining 50.7% had normal platelet counts (Table 3) . A significant association was observed between the thrombocytopenia and the age groups. Thrombocytopenia was found to be more severe in age groups of 30-40 years than in the older age group and this difference was significant (p <0.05). The relative importance of each variable in the model evaluation is linked to the importance of each feature in making a prediction and it does not relate to model accuracy [43] . Based on the model performance, we have extracted the topmost risky and protective features for a longer LOS. We have called features associated with a longer LOS (positive correlation) as risky features, whereas safe features are those which demonstrate an inverse relationship with longer LOS. Top features based on LR with elastic-net are reported in Fig. 6 . The most significant factors for a longer stay are lymphocytes, total leucocyte count, alanine aminotransferase (ALT), aspartate aminotransferase (AST), red blood cell count (RBC), red cell distribution width (RDW), haematocrit, neutrophils as well as platelet count. Admission type emergency, blood transfusion, marital status being single and, right effusion were also significant in predicting prolonged LOS. On the other hand, eosinophils and AST and high lymphocyte are the safe features which contribute to a shorter LOS. The random forest model, with earlier parameter setting, was used to extract important factors in Fig. 6 , features with a great impact on LOS are listed in order of variable importance. The random forest also identified some top features such as lymphocytes, eosinophils, haemoglobin, and marital status which agree with elastic-net. From both the methods, the most significant factors were blood transfusion, admission type emergency, assisted ventilation, and thorombocytopenia. haemoglobin low, TLC high are also strong predictors of prolonged LOS of dengue patients. haematocrit (low and high) played a notable role as well since analysis revealed that patients <40% and >50% haematocrit value for men and patients <36% and >46% haematocrit value for women statistically had increased mean LOS. A low (critical value<15-20%) haematocrit may cause cardiac failure or death [47] [48] [49] and a high (critical value>60%) may cause spontaneous blood clotting [47] [48] [49] . The low value of lymphocytes signifies that the patients are more likely to have prolonged LOS. Furthermore, the previous admission also increases risk for a prolonged LOS. Thus, the most remarkable features influencing LOS for dengue patients obtained by algorithms are blood transfusion, admission type emergency, assisted ventilation, haemoglobin low, TLC high, haematocrit low and high, lymphocytes low and, previous admission. Table 4 shows the performance measures of the LR with elastic-net and random forest classifiers. The LR with elastic-net model achieved an AUC of 0.75, whereas the random forest model exhibited an AUC of 0.72. AUC of 0.75 demonstrated that LR with elastic-net model has good ability to predict prolonged hospitalisation among patients with dengue (Fig. 7) . Confusion matrices for both the models are available in Fig. 3 of the Supplemental Material. Fig. 7 Receiver-operating characteristics curve analysis of both the models on test data predicting prolonged hospitalisation among dengue patients This research investigated the determinants of hospital LOS in the patient's representative of confirmed dengue diagnosis within our group of healthcare centers. Previous studies have predicted in-hospital mortality [19] of dengue patients. Our findings indicated that there is a significant association between LOS greater than 5 days and amount of lymphocytes, leucocyte, alanine aminotransferase (ALT), aspartate aminotransferase (AST), red blood cell count (RBC), red cell distribution width (RDW), haematocrit, neutrophils as well as platelet count. Admission type being emergency, blood transfusion, marital status being single, and right effusion were also significant in predicting prolonged LOS. In general, we found that LR with elastic-net model trained on data from Max group healthcare systems, was able to predict the prolonged LOS better than random forest. We plan to use this predictive model as a screening tool to proactively identify high-risk patients to receive a care coordination intervention to reduce prolonged LOS of Dengue patients. This is in direct contrast to the department care, which is reactive, requiring patient's symptoms to be present, to receive care by clinicians and care coordinators. We believe that transforming care coordination from reactive to a proactive activity carries great potential to reduce LOS. As enrolment in the hospital is still in progress, it remains to be seen whether this model can be translated into the real world. If the model is successful, there are potential implications for LOS and cost reduction. Long stay and care coordination activities are often expensive and aligning care coordination resources with patients responsible for large costs to the health system may optimize resource allocation. We found that both models have good overall performance. The alignment of important variables between these two models provides more confidence in the prediction. Implementing any of these models can enable efficient management of hospital resources and plan for preventive interventions for patients with intense conditions. As a result, this study provides better insight into the underlying factors that influence the LOS. The operating point selection in the ROC curve is for operational reason. Our aim is to predict dengue patients with a risk of higher length of stay, so as to direct limited hospital resources to the high risk group. Therefore, we have chosen to trade-off low sensitivity for higher specificity. More specifically, to reduce the Type I error where the false positives are minimized, and we should be able to identify few patients with higher specificity for an early intervention and optimized resource allocation. As we know, healthcare data is generally not fully structured, it is distributed across various locations. We are aware that our current study has several limitations, which could be addressed in subsequent works. First, while our model is likely generalizable to Max group healthcare systems from which the data is collected, however, obtained information is not clinically exhaustive, as present work has fully relied on the demographic, administrative, investigation, and radiological characteristic data retrieved from hospital electronic databases. It may not generalize other parts of the Indian healthcare system with different demographics, practice patterns. The model may need to be developed for each community, using the process we describe. Second, we did not collect data from outside of our healthcare environment, where we may miss earlier predictors of prolonged LOS in our model. The study indicates that routinely collected hospital data can be used to identify the prolonged LOS of dengue patients and may also provide insight into the factors influencing hospital LOS of dengue patients that can easily be interpreted by the clinician. Our model results show that LR with elastic-net and a random forest model can predict dengue patient's LOS, but still, LR with elasticnet is the best fit with an AUC of 0.75. We intend to implement the derived model in our information systems for real-time feedback to the clinician to reduce the long LOS of dengue confirmed patients during the admission. This could potentially help clinicians in planning for preventive interventions, thereby leading to improvement in health services and to manage the hospital resources more efficiently. Also, we intend to conduct a follow-up study to measure and potentially improve the predictive performance of the model, after system implementation is rolled out. Dengue/dhf situation in india Organization WH Dengue and severe dengue 10èMe conférence francophone de modélisation, optimisation et simulation-MOSIM'14 Open forum infectious diseases pacific asia conference on information systems Machine learning proceedings C4. 5: Programs for machine learning by j. ross quinlan Tropical Diseases, W.H.O.D. of Control of Neglected Tropical Diseases, W.H.O. Epidemic, P. Alert, Dengue: guidelines for diagnosis AMIA Annual Symposium Proceedings Prospective ehr-based clinical trials: the challenge of missing data Data Mining: Concepts and Techniques. The Morgan Kaufmann Series in Data Management Systems A manual of laboratory and diagnostic tests The online version contains supplementary material available at https://doi.org/10.1007/s10729-021-09571-3.