key: cord-0997038-agpjosmk authors: Coombes, Caitlin E.; Coombes, Kevin R.; Fareed, Naleef title: A novel model to label delirium in an intensive care unit from clinician actions date: 2021-03-09 journal: BMC Med Inform Decis Mak DOI: 10.1186/s12911-021-01461-6 sha: 2c22f3113df4a5775f5374d874d9d5dd05582f3d doc_id: 997038 cord_uid: agpjosmk BACKGROUND: In the intensive care unit (ICU), delirium is a common, acute, confusional state associated with high risk for short- and long-term morbidity and mortality. Machine learning (ML) has promise to address research priorities and improve delirium outcomes. However, due to clinical and billing conventions, delirium is often inconsistently or incompletely labeled in electronic health record (EHR) datasets. Here, we identify clinical actions abstracted from clinical guidelines in electronic health records (EHR) data that indicate risk of delirium among intensive care unit (ICU) patients. We develop a novel prediction model to label patients with delirium based on a large data set and assess model performance. METHODS: EHR data on 48,451 admissions from 2001 to 2012, available through Medical Information Mart for Intensive Care-III database (MIMIC-III), was used to identify features to develop our prediction models. Five binary ML classification models (Logistic Regression; Classification and Regression Trees; Random Forests; Naïve Bayes; and Support Vector Machines) were fit and ranked by Area Under the Curve (AUC) scores. We compared our best model with two models previously proposed in the literature for goodness of fit, precision, and through biological validation. RESULTS: Our best performing model with threshold reclassification for predicting delirium was based on a multiple logistic regression using the 31 clinical actions (AUC 0.83). Our model out performed other proposed models by biological validation on clinically meaningful, delirium-associated outcomes. CONCLUSIONS: Hurdles in identifying accurate labels in large-scale datasets limit clinical applications of ML in delirium. We developed a novel labeling model for delirium in the ICU using a large, public data set. By using guideline-directed clinical actions independent from risk factors, treatments, and outcomes as model predictors, our classifier could be used as a delirium label for future clinically targeted models. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12911-021-01461-6. presentation of the syndrome is broad, including an agitated, hyperactive subtype; a somnolent, hypoactive subtype; or mixed features [5] . The hypoactive subtype is less frequently diagnosed and has poorer prognosis [5] . Additional patients may manifest with subsyndromal delirium or "attenuated delirium syndrome": a subclinical confusional state meeting part, but not all, of the DSM-5 criteria for delirium [12] . Due in part to delirium's comorbid presentation with serious illness, advanced age, depression, and dementia [5, 12] and its heterogeneous and fluctuating symptom presentation [12] , delirium is often under-recognized in the hospital [5, 12, 13] . Because delirium arises comorbidly, the primary treatment is identification, diagnosis, and treatment of the etiologic organic illness or toxic insult, accompanied by pharmacological and nonpharmacological delirium symptom management [11] . These challenges make delirium an important target of machine learning (ML) [14] [15] [16] [17] [18] [19] [20] [21] [22] . Training ML models require a valid delirium label which can accurately capture a patient with the condition. For a method of labeling to be useful as a foundation for clinical prediction, it must be independent of both risk factors and outcomes of interest. Although the gold standard is a provider-administered screening tool such as the Confusion Assessment Method for the ICU (CAM-ICU) [13, 23] , these labor-intensive identifiers must be prospectively administered and are not available in all settings [13, [20] [21] [22] , revealing a need for a delirium identifier that can be abstracted retrospectively and computationally from the medical record. Two preliminary studies on small cohorts (< 400 patients) have proposed other simple, chart-based labels when CAM-ICU is absent. Kim et al. [24] used the CAM-ICU and provider interview as the gold standard to label delirium with modest sensitivity (30%), high specificity (97%) and high positive predictive value (PPV = 83%) from the presence of either an International Classification of Diseases (ICD) code or antipsychotics use, with improved sensitivity for delirium that was hyperactive or mixed type (64%) or severe (73%). By chart review, Puelle et al. [25] identified eight key words or phrases (altered mental status, delirium, disoriented, hallucination, confusion, reorient, disorient and encephalopathy) with high PPV (60-100%) for delirium (model sensitivity and specificity not reported). Here we present an assessment of three methods to label delirium in the chart from medical record events. We propose a supervised binary classifier based on counts of 31 clinician actions, including medications, orders, and clinical impressions in free-text notes. All 31 predictors are independent of risk factors and outcomes of interest, generating a labeling method that could be used as a foundation for downstream clinical predictions. We compare this model to Kim et al. ' s classification based on ICD code and antipsychotics use ("Kim's classifier") and to Puelle et al. 's eight words with high PPV ("Puelle's classifier"). To the best of our knowledge, we are the first to test these proposals on a large-scale dataset. Because our dataset is too large to permit chart review and CAM-ICU is unavailable, we set ICD code as our initial delirium identifier. We assess the quality of classification of each model by biological validation [26] on clinically meaningful, delirium-associated outcomes, demonstrating superior performance with our model of 31 clinician actions. Our model has the potential to be generalized and implemented across ICU datasets to support improved labeling for downstream clinical predictive modeling. In 2015, Inouye et al. proposed research priorities for delirium, including improved diagnosis and subtyping, stratification of high risk patients, biomarker detection, and identification of genetic determinants [3] . Researchers have since applied unsupervised ML, including clustering [15] and latent class analysis [14] , to subtype patients. More commonly, supervised ML is used to predict delirium incidence within an ICU stay based on a priori risk factors [21] , heart rate variability [17] , or medical record events from the first 24 h of hospitalization [16, 18, 20, 27] . To make clinically actionable predictions, the researcher requires a delirium label that is independent of the clinical covariates and predictors of interest. The preferred measures in clinical practice for labeling delirium are nurse-or provider-administered, validated screening tools, including the CAM-ICU [13, 23] and the Intensive Care Delirium Screening Checklist (ICDSC) [13, 28, 29] . CAM-ICU administered during treatment is a mainstay label of delirium in the ML research setting [14] [15] [16] [17] [18] [19] . However, variations in institutional practice and physician buy-in can lead to inconsistent use of the CAM or ICDSC in the clinical setting [13] . When CAM-ICU is unavailable or suspect, researchers may employ nurse chart review [20, 21] . However, chart review relies on clinical judgment [25] and poses time and labor costs that grow prohibitive as data sets increase in size. Other researchers have used ICD codes as a delirium label [22] . Though convenient, ICD codes, especially secondary codes (such as delirium in a critical illness setting), are prone to high levels of missingness and inaccuracy [30] [31] [32] . Although the prevalence of delirium in the ICU has been estimated to be as high as 24-82% [2] [3] [4] , published models have been built using ICD code labels for delirium that may be as sparse as 3.1% [22] . This mismatch between proportion of expected patients with delirium and available ICD codes suggests a risk of outcome misclassification if ICD codes are used, with potential for serious bias in learned model outputs [33] . Weaknesses in delirium labeling underlying much stateof-the-art research calls the generalizability and clinical utility of these studies into question. Various tools are available when binary outcome misclassification in a dataset is suspected. Sensitivity analysis can be used to adjust the summary output of a logistic regression model, but it relies heavily on frequency estimates supplied by the researcher's a priori knowledge of the field, and cannot be learned from the model [33] . For some binary classifiers, outcome misclassification can be addressed by tuning model cut-points based on a priori knowledge or researcher goals for sensitivity or specificity or properties of the receiver operating curve (ROC) to enact a desired reclassification, a core practice in diagnostic test development [34] with applications in supervised model refinement [16] . Assessing outcome reclassification on real data is challenging due to absence of a gold standard. However, the concern is pressing: unless model fit is perfect (sensitivity and specificity = 100%), all binary classification inherently generates some degree of "outcome reclassification, " where members labeled as belonging to one group when entering the model are later predicted to belong to the other group. For clinical regression models, Harrell et al. proposed that the concordance index or c-index, calculated from pairwise comparisons of a prognostic indicator between classified and reclassified subjects, could be employed as a "clinically meaningful" measure of model goodness-of-fit [37] . We have previously proposed the related principle of biological validation: that ML assignments can be meaningfully validated by employing wellunderstood biological outcomes when ground-truth is unavailable [26] . Inspired by Harrell's approach, we compare five prognostic measures between classified and reclassified groups to biologically validate outcome reclassification and model goodness-of-fit for delirium identification. Study data were drawn from Medical Information Mart for Intensive Care-III (MIMIC-III), a freely available database of electronic health record (EHR) data collected on 63,157 intensive care unit (ICU) admissions at Beth Israel Deaconess Medical Center from 2001 to 2012 [38] [39] [40] [41] . Delirium within a hospitalization was defined by ICD-9 code [24] . (Additional file 2: Restricting LOS removed 2,315 outlier hospitalizations (4.6%) with LOS up to 295 days. From the cohort population, 25% of positives and negatives were randomly sampled and reserved for a test set (12,135 admissions), retaining 75% for training (36,406 admissions) . We proposed a model to label presence of delirium in a chart based on clinician actions. We hypothesized that changes in clinical actions concordant with diagnostic work-up for delirium can serve as an indicator that the clinical team had made a delirium diagnosis. Clinician actions presumed to indicate a response to delirium onset were identified from published guidelines for delirium work-up and abstracted from electronic health record (EHR) data. These included 18 laboratory and imaging orders and 4 medications [13, 42] . Pharmacologic interventions were selected based on evidence of widespread use for the management of delirium, not by efficacy or other clinical measures [13] . Clinical impressions were extracted from the presence of eight words or phrases with high PPV for delirium in EHR notes [25] . Additional file 2: Table A .2 lists the 31 included clinical actions. No steps were taken to identify or impute missing values. Occurrence of clinician actions were formed into an event count matrix across each admission [43] . A more detailed description of data pre-processing, with code, is available in Additional file 1: File B. We compared performance of five binary ML classifiers [16, 17, 19, 22] , including logistic regression (stats R-package), Classification and Regression Trees (CART; rpart R-package) [44, 45] , supervised random forests (randomForest) [46, 47] , naïve Bayes (e1071) [48, 49] , and support vector machines (SVM; e1071) [49, 50] . (Additional file 1: File A.1) The logistic regression model underwent refinement and feature selection by stepwise forwards and backwards selection, L1/LASSO (Least Absolute Shrinkage and Selection Operator) penalization [51, 52] , L2/Ridge penalization [53] , and combined L1-L2 penalization (penalized). [54] Model performance on the training set was compared by ROC visualization and AUC (pROC) [55] . (Additional file 1: File A.2) The top performing model was selected by maximum AUC. Model development is reported here in accordance with Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD) guidelines [56] . Logistic regression generates a model with a log-odds threshold set at zero to divide hospitalizations with incident delirium from those without. This "natural" or "default" cut-point reflects the prior probability of delirium within the cohort, and is therefore susceptible to error from outdated prior information (such as known misclassification). As commonly implemented in diagnostic test development, we tuned the cut-point of our binary classifier to calibrate sensitivity and specificity to correct for known misclassification [34] , a technique in practice in delirium supervised model development [16] . Because we suspect ICD-9 code missingness [30] [31] [32] , we desire a model with high sensitivity. In the case of known misclassification, we believe that some of the additional positives generated by increased sensitivity represent true, but unlabeled, positives that have been reclassified. These reclassified positives represent hospitalizations containing real incident delirium, but lacking ICD-9 codes due to a priori outcome misclassification from known ICD-9 code missingness [30] [31] [32] . Thus, reclassification by up-tuning sensitivity allows us to generate a model that better labels the presence of true delirium. On training data, we compared six algorithmic methods for reclassification of a binary model by tuning sensitivity: the Youden index [57] , maximizing both sensitivity and specificity, maximizing accuracy, minimizing the distance to ROC (0,1), maximizing accuracy given a minimum constraint of sensitivity, and maximizing sensitivity given a minimal specificity constraint (Additional file 1: A.3; cutpointr R-package) [58] . We determined the threshold of choice based on concordance between measures, choosing a cut-point that represented trends between tuning methods. We also visualized reclassification by each cut-point by density plot. The final model was trained on training data using the binary classifier with highest AUC, selected by maximum AUC, and the cut-point with highest measured concordance. This best-performing model was run on retained test data. Validation was performed on test data only. We identified two related models in the literature proposed from chart review to predict incidence of delirium within a hospital stay from clinician actions and implemented them at an expanded scale. To assess Puelle's classifier [25] , we trained a logistic regression model with eight binary predictors for presence or absence at any point in a hospitalization of eight words in notes with high PPV for delirium on the training set (Additional file 1: Material A.4.1). Previously, we had implemented the same eight words in our model of 31 clinician actions (Additional file 2: A.2). We omitted Puelle's final criterion, "'alert and oriented' (< 3)" due to difficulty of abstracting this data point from free-text note fields without natural language processing. The resultant model was validated on the test set. The binary threshold was chosen with the Youden Index. We compared our novel model to Puelle's classifier by the Akaike Information Criterion (AIC) and the Bayes Information Criterion (BIC) [59] . We tested Kim's classifier [24] by labeling hospitalizations as delirium-positive if they contained a delirium ICD-9 code or if anti-psychotics were prescribed at any point during hospitalization (Additional file 1: Material A.4.2). Admissions were delirium-negative if a delirium ICD-9 code was not applied and anti-psychotics were not administered. This simple recategorization did not require training and was applied directly to the test set. Statistical measures of final model performance included sensitivity, specificity, PPV, negative predictive value (NPV), AUC (for supervised models), and comparison against expected prevalence of ICU delirium. Reclassification was validated on five clinically meaningful demographic and outcome measures: age at admission [3] , discharge location [5] [6] [7] , death in hospital, death within 30 days of admission [38] , and one-year mortality from admission [10] . To assess success and meaningfulness of re-classification and goodness-of-fit for each model, we separated admissions into four groups (Table 1) . First, we compared ICD-Positives and Double-Negatives. If these were significantly different, we report tests comparing ICD-Positives to Reclassified-Positives, Double-Negatives to Reclassified-Negatives, and Reclassified-Positives to Reclassified-Negatives. Similarity or difference between groups was assessed using Tukey multiple comparisons Table 1 Definitions of four classified and re-classified categories generated by a binary classifier For any binary classifier with less than 100% accuracy, model testing results in some degree of reclassification of positives and/or negatives, generating four groups. For example, some admissions with an ICD-9 code for delirium are labeled as negative by the model, leading to re-classification From 48,451 unique adult admissions in MIMIC-III with LOS ≤ 31 days, we identified 3,850 patients with delirium by ICD-9 codes (7.9%). Demographic characteristics and pertinent outcomes of the cohort are described in Table 2 . Briefly, the group with patients with delirium had statistically significant differences with the group without delirium for race/ethnicity, age at admission, and length of stay. actions. Because three of four feature selection methods recommended inclusion of all 31 features and the potential for knowledge loss with predictor elimination, the model with 31 clinical actions was selected. Table 3 presents 17 highly significant predictors (p < 0.001) from the final, multiple logistic regression model of 31 clinical actions. The full model can be found in Additional file 2: Table A. 3. Among clinical impressions captured from single words in text notes, odds of delirium were higher with each note mentioning "mental status" (OR = 1.14), "deliri*"(OR = 1.12), "hallucin*"(OR = 1.25), or "confus*" (OR = 1.16), and "disorient*"(OR = 1.10). Odds of delirium were lower for each note mentioning "reorient*" (OR = 0.86). Among laboratory tests, odds of delirium were significantly greater with clinical orders for urine culture (OR = 1.13), thyroid function test (OR = 1.12), serum B12 or folate (OR = 1.45), and blood or urine toxicology screen (OR = 1.28). Prescription orders for antipsychotics (OR = 1.44), benzodiazepines (OR = 1.08), and dexmedetomidine (OR = 1.43) were associated with higher odds of delirium. We compared six metrics for sensitivity (Se) tuning: the Youden Index (Se = 80%), maximizing sensitivity and specificity (Se = 80%), maximizing accuracy ML holds the potential to unlock improved diagnosis, risk stratification, and treatment of delirium in the ICU, a complex syndrome associated with serious morbidity and mortality. Before ML can be used to make clinically actionable predictions, informaticians developing models for delirium incidence, prognosis, and treatment need tools to accurately label patients with delirium in large datasets, despite serious flaws with current labeling methods. Ideally, delirium researchers need a valid, efficient, computational tool that is independent of clinical variable of interest to label patients with delirium in large datasets without the need for chart review on in-person clinical assessments. A high-accuracy, computationally-generated label could be used for training future models on pressing clinical questions, including identifying timing of delirium onset in the hospital course or classifying patients with delirium into clinically relevant clusters. Here, we proposed to label delirium from clinician actions, using placement of orders associated with standard workup of delirium as a surrogate for clinicians recognizing delirium in real time. After comparison of five supervised ML methods and four methods of feature selection, we proposed a novel, multiple logistic regression model to label ICU delirium from counts of 31 clinician actions abstracted from clinical guidelines, with high AUC (0.83). If predictors are not independent, we expect improved performance from non-linear models. However, because these 31 clinical actions are regularly employed in wider clinical practice independent of delirium and thus none are specific for delirium, it is possible that a greater than expected independence between covariates resulted in unexpectedly good performance from the logistic model. The assumption of independence is reinforced by a correlation matrix with less than 4% of 31 predictors having a Spearman's ρ of ≥ 0.6. The logistic model is both appropriate to the data and offers clearer, biological interpretability than many non-linear models. Model performance on a training set was validated on a randomly selected test set. The model was concordant with clinical intuition, with odds of delirium higher with words such as "deliri*, " "hallucin*, " and "disorient*, " but odds of delirium lower with "reorient*. " Marked elevations in odds of delirium were associated with toxicology screening, used to detect delirium from substance intoxication or withdrawal, and prescription of antipsychotics or dexmedetomidine. Evidence of intoxication falls within the DSM-5 criteria for diagnosis of delirium [1, 12] . Guidelines recommend antipsychotics as the drug class of choice for symptomatic treatment of delirium [13] . Dexmedetomidine is recommended as a preferred drug for management of delirium on mechanically ventilated patients [13] . We compared our labeling model to two similar models previously proposed in the literature to abstract delirium incidence from chart review. Both our model and Puelle's classifier produced sensitivity and specificity between 71 and 80%, indicating good fidelity to delirium ICD-9 codes with modest reclassification of both positives and negatives. Although the implementation of Puelle's classifier has similar PPV and sensitivity with fewer predictors, Kim et al. [24] reported low sensitivity (30%) but high specificity (97%) of their classifier on a prospective study of 184 adults. Specificity on the expanded MIMIC-III data set was 85.7%. Our implementation of Kim classifier never generates reclassified negatives: all patients with ICD-9 codes for delirium are classified in the delirium group by definition. Thus, the 100% sensitivity and 100% NPV reflect definitions for model creation, not quality of fit. The PPV of Kim's classifier (37.7%) surpasses that of Puelle's classifier (19.8%) and our model (19.7%). However, PPV is also defined by simple re-categorization in Kim's classifier, and is not indicative of improved performance. For both Kim's and Puelle's classifiers, reduced performance with computational application on the expanded, MIMIC-III dataset suggest limitations in generalizability and validation of these small-scale proposals. Because ground-truth is not reasonably attainable in these data by chart review due to their very large size, we compared goodness-of-fit of the three models by biological validation [26] . First, we assume that, for a good model, predicted prevalence of delirium (sum of ICD-Positives and Reclassified-Positives) should approach known ICU delirium prevalence from the literature. In a meta-analysis of 48 studies on ICU delirium, Krewulak et al. [2] obtained an overall pooled delirium prevalence of 31%. Kim's classifier predicted delirium prevalence above ICD-9 code frequency (21.1%). Our model (32.5%) and Puelle's classifier (31.9%) predicted delirium prevalence concordant with Krewulak's pooled figures, indicating an appropriate quantity of reclassified patients. We further biologically validate against clinically meaningful outcome measures. We compared classification and reclassification groups by age, discharge location, short-term risk of death, and one-year mortality. Our method of model validation rests on the principle that application of any binary classifier that does not have perfect (100%) sensitivity and specificity reclassifies subjects, such that some number of subjects receive a classification from the model that differs from their input label assignment (Table 1, Fig. 2 ). If the binary classification model is valid, then this unavoidable reclassification should result in reclassified subjects resembling their reclassified assignment more so than their label assignment across the five comparison measures. On the basis of biological validation, our novel model markedly outperformed Kim's and Puelle's classifiers, correctly capturing significant differences between Double-Positives and Double-Negatives and between Reclassified-Positives and Reclassified-Negatives on all five measures. Delirium is a heterogeneous syndrome with subtype variation, including an underdiagnosed hypoactive subtype and a subclinical form [5, 12] . Thus, differences between Double-Positives and Reclassified-Positives may represent variability in clinician practice between delirium subtypes, with our model reclassifying patients belonging to subtypes underrepresented in previous studies. The clinical utility of our novel model rests on important contextual factors. First, our study is based on publicly available data from one institution. However, our model uses one of the largest count of observations for developing a ML model for delirium than previously used in other studies. Although we propose the implementation of a generalizable labeling model that is relatively less labor intensive than models that depend upon screening tools, ICD codes, and chart review (many of which that are not easily available), we recognize the importance of heterogeneity that will exist at both an institutional and a local provider level [62] . Examples include sub-group and temporal considerations and idiosyncratic coding and documentation practices. There is a need for local validation and recalibration to ensure the optimal performance of our labeling method [63] . Because of under-identification of hypoactive or milder delirium in the clinical [5] or analytic [24] setting, deviations in model goodness of fit may reflect variation in clinical practice and patient presentation between delirium subtypes. As noted previously, our model's overall performance, albeit relatively better than other counterpart models, still has constraints in terms of factors such as sensitivity and PPV. Like other ML models, decisions to implement our model will require considerations about tradeoffs around model performance factors, the costs of model implementation, and the implications of falsepositives [64, 65] . The potential response to positive cases and other approaches that can be used to establish true-positive cases will be critical. Finally, because this model does not use time-dependent variables, it may not be able to label a patient with delirium until after all encounter data is available. Future work to predict delirium subtypes from the medical record is warranted. Patients being presented with other diseases, example SARS-CoV-2, may result in the introduction of other features that may improve the calibration of the model given the prevalence of such a disease in the local ICU. ICU delirium has been shown to be comorbid with SARS-CoV-2, arising from disorientation and social isolation, use of mechanical ventilation, and an aging patient population [66] . We developed a novel labeling model for delirium in the ICU using a large data set from a publicly available database. This database has been previously used to develop ML models for other applications [67, 68] . Our model incorporates 31 clinical actions as features, an approach that has been previously overlooked in other delirium prediction models. We assessed the performance of our labeling model based on other delirium prediction models and biological markers of significance. Our model demonstrates relative superiority based on the assessment rubric; however, more validation and recalibration are needed to consider important contextual factors that may arise before and during the use of the model in a local ICU. These results provide a tool to aid future researchers developing ML classifiers for ICU patients with delirium. Association AP: Diagnostic and statistical manual of mental disorders (DSM-5 ® ) Incidence and prevalence of delirium subtypes in an adult ICU: a systematic review and meta-analysis Delirium in elderly people Delirium as a predictor of mortality in mechanically ventilated patients in the intensive care unit Delirium in hospitalized older adults the clinic. Delirium. Annals Internal Med Effect of delirium and other major complications on outcomes after elective surgery in older adults Cognitive trajectories after postoperative delirium Long-term cognitive impairment after critical illness Delirium in elderly patients and the risk of postdischarge mortality, institutionalization, and dementia: a meta-analysis Delirium in elderly adults: diagnosis, prevention and treatment Delirium diagnosis, screening and management Clinical practice guidelines for the management of pain, agitation, and delirium in adult patients in the intensive care unit Latent class analysis of the multivariate Delirium Index in long-term care settings Identification of sub-groups in acutely ill elderly patients with delirium: a cluster analysis Development and validation of an electronic health record-based machine learning model to estimate delirium risk in newly hospitalized patients without known cognitive impairment Prediction and early detection of delirium in the intensive care unit by using heart rate variability and machine learning Prediction of incident delirium using a random forest classifier Exploiting machine learning algorithms and methods for the prediction of agitated delirium after cardiac surgery: models development and validation study Performance of electronic prediction rules for prevalent delirium at hospital admission Validation of a delirium risk assessment using electronic medical record information Delirium prediction using machine learning models on preoperative electronic health records data Delirium in mechanically ventilated patients: validity and reliability of the confusion assessment method for the intensive care unit (CAM-ICU) Evaluation of algorithms to identify delirium in administrative claims and drug utilization database The language of delirium: keywords for identifying delirium from medical records Unsupervised machine learning and prognostic factors of survival in chronic lymphocytic leukemia On the representation of machine learning results for delirium prediction in a hospital information system in routine care Use of a validated delirium assessment tool improves the ability of physicians to identify delirium in medical intensive care unit patients Intensive care delirium screening checklist: evaluation of a new screening tool. Intensive Care Med Measuring diagnoses: ICD code accuracy Strategies for handling missing data in electronic health record derived data. EGEMS (Washington, DC) Accuracy and completeness of clinical coding using ICD-10 for ambulatory visits Modern epidemiology Optimum binary cut-off threshold of a diagnostic test: comparison of different methods using Monte Carlo technique Misclassification of outcome in case-control studies: Methods for sensitivity analysis Learning from positive and unlabeled data: A survey Development of a clinical prediction model for an ordinal outcome: the World Health Organization Multicentre Study of Clinical Signs and Etiological agents of Pneumonia, Sepsis and Meningitis in Young Infants. WHO/ARI Young Infant Multicentre Study Group MIMIC-III, a freely accessible critical care database The MIMIC Code Repository: enabling reproducibility in critical care research PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals Alistair EW: The MIMIC-III Clinical Database Harrison's principles of internal medicine: McGraw-Hill Education Assessing clinical heterogeneity in sepsis through treatment patterns and machine learning Classification and regression trees An introduction to recursive partitioning using the RPART routines Random forests Classification and regression by randomForest. R news Automatic indexing: an experimental inquiry The e1071 package. Misc Functions of Department of Statistics Support-vector networks Linear inversion of band-limited reflection seismograms Regression shrinkage and selection via the lasso Ridge regression: Biased estimation for nonorthogonal problems L1 and L2 penalized regression models. Vignette R Package Penalized pROC: an open-source package for R and S+ to analyze and compare ROC curves Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): the TRIPOD statement Index for rating diagnostic tests The cutpointr package: Improved and tidy estimation of optimal cutpoints Sensitivity and specificity of information criteria. Brief Bioinform Fisher's exact approach for post hoc analysis of a chi-squared test A simple sequentially rejective multiple test procedure Design and implementation of a standardized framework to generate and evaluate patient-level prediction models using observational healthcare data Designing risk prediction models for ambulatory no-shows across different specialties and clinics convenient online submission • thorough peer review by experienced researchers in your field • rapid publication on acceptance • support for research data, including large and complex data types • gold Open Access which fosters wider collaboration and increased citations maximum visibility for your research: over 100M website views per year submit your research ? Individualizing risk prediction for positive COVID-19 testing: results from 11,672 patients Identifying surgical site infections in electronic health data using predictive models COVID-19: ICU delirium management during SARS-CoV-2 pandemic An empirical evaluation of deep learning for ICD-9 code assignment using MIMIC-III clinical notes Multitask learning and benchmarking with clinical time series data Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations The authors do not have any personal acknowledgements.Authors' contributions CEC contributed in study design, data analysis, and manuscript preparation. KRC contributed in data analysis and manuscript preparation. NF contributed in study design, data analysis, and manuscript preparation. All authors read and approved the final manuscript. The authors do not have any outside funding for this project to acknowledge. The dataset supporting the conclusions of this article is available in the MIMIC-III repository, [38, 39, 41] https ://mimic .physi onet.org/. The online version contains supplementary material available at https ://doi. org/10.1186/s1291 1-021-01461 -6. Additional file 2. Study R Markdown file with data pre-processing and variable selection.