key: cord-0897207-42p2dr0w authors: Cummings, B. C.; Ansari, S.; Motyka, J. R.; Wang, G.; Medlin, R. P.; Kronick, S. L.; Singh, K.; Park, P. K.; Napolitano, L. M.; Dickson, R. P.; Mathis, M. R.; Sjoding, M. W.; Admon, A. J.; Ward, K. R.; Gillies, C. E. title: Validation and comparison of PICTURE analytic and Epic Deterioration Index for COVID-19 date: 2020-07-10 journal: nan DOI: 10.1101/2020.07.08.20145078 sha: afc6b92c3e4cd9bb86d96e19f91c50c58e0a6ab7 doc_id: 897207 cord_uid: 42p2dr0w Introduction The 2019 coronavirus (COVID-19) has led to unprecedented strain on healthcare facilities across the United States. Accurately identifying patients at an increased risk of deterioration may help hospitals manage their resources while improving the quality of patient care. Here we present the results of an analytical model, PICTURE (Predicting Intensive Care Transfers and other UnfoReseen Events), to identify patients at a high risk for imminent intensive care unit (ICU) transfer, respiratory failure, or death with the intention to improve prediction of deterioration due to COVID-19. We compare PICTURE to the Epic Deterioration Index (EDI), a widespread system which has recently been assessed for use to triage COVID-19 patients. Methods The PICTURE model was trained and validated on a cohort of hospitalized non-COVID-19 patients using electronic health record data from 2014-2018. It was then applied to two hold-out test sets: non-COVID-19 patients from 2019 and patients testing positive for COVID-19 in 2020. PICTURE results were aligned to the EDI for head-to-head comparison via Area Under the Receiver Operator Curve (AUROC) and Area Under the Precision Recall Curve (AUPRC). We compared the models' ability to predict an adverse event (defined as ICU transfer, mechanical ventilation use, or death) at two levels of granularity: (1) maximum score across an encounter with a minimum lead time before the first adverse event and (2) predictions at every observation with instances in the last 24 hours before the adverse event labeled as positive. PICTURE and the EDI were also compared on the encounter level using different lead times extending out to 24 hours. Shapley values were used to provide explanations for PICTURE predictions. Results PICTURE successfully delineated between high- and low-risk patients and consistently outperformed the EDI in both of our cohorts. In non-COVID-19 patients, PICTURE achieved an AUROC (95% CI) of 0.819 (0.805 - 0.834) and AUPRC of 0.109 (0.089 - 0.125) on the observation level, compared to the EDI AUROC of 0.762 (0.746 - 0.780) and AUPRC of 0.077 (0.062 - 0.090). On COVID-19 positive patients, PICTURE achieved an AUROC of 0.828 (0.794 - 0.869) and AUPRC of 0.160 (0.089 - 0.199), while the EDI scored an AUROC of 0.792 (0.754 - 0.835) and AUPRC of 0.131 (0.092 - 0.159). The most important variables influencing PICTURE predictions in the COVID-19 cohort were a rapid respiratory rate, a high level of oxygen support, low oxygen saturation, and impaired mental status (Glasgow coma score). Conclusion The PICTURE model is more accurate in predicting adverse patient outcomes for both general ward patients and COVID-19 positive patients in our cohorts compared to the EDI. The ability to consistently anticipate these events may be especially valuable when considering a potential incipient second wave of COVID-19 infections. PICTURE also has the ability to explain individual predictions to clinicians by ranking the most important features for a prediction. The generalizability of the model will require testing in other health care systems for validation. The effect of the 2019 coronavirus (COVID-19) on the US healthcare system cannot be overstated. It has led to unprecedented clinical strain in hospitals across the nation, prompting the proliferation of ICU capability and of lower-acuity field hospitals to accommodate the increased patient load. A predictive early warning system capable of identifying patients at increased risk of deterioration could assist hospitals in maintaining a high level of patient care while more efficiently distributing their thinly stretched resources. However, a recent review has illustrated that high-quality, validated models of deterioration in COVID-19 patients are lacking [1] . All 16 of the models appraised in this review were rated at high or unclear risk of bias, mostly because of non-representative selection of control patients. A primary concern is that these models may overfit to the small COVID-19 datasets that are currently available. Early warning systems have been and continue to be applied in hospital settings prior to the COVID-19 pandemic to predict patient deterioration events before they occur, giving healthcare providers time to intervene [2] . The prediction of adverse events such as ICU admission and death provides crucial information to avert impending critical deterioration: it is estimated that 85% of such events are preceded by detectable changes in physiological signs [3] that may occur up to 48 hours before the event [4] . In addition, approximately 44% of events are avoidable through early intervention, [5] and 90% of unplanned transfers to the intensive care unit (ICU) are preceded by a new or worsening condition [6] , [7] . Such abnormal signals indicate that predictive data analytics may be used to alert providers of incipient deterioration events, ultimately leading to improved care and reduced costs [8] , [9] . Given the number of unknowns surrounding the pathophysiology of COVID-19, early warning systems may play a pivotal role in treating patients and improving outcomes. One model which has been assessed in COVID-19 patients is the Epic Deterioration Index (EDI) [10] , [11] . The EDI has the advantage over models built on COVID-19 specific data in that it is not overfit to small datasets as it was trained on over 130,000 encounters [11] , [12] . Recent work has suggested it may be capable of stratifying COVID-19 patients according to their risk of deterioration [11] . The outcomes used in this study were those considered most relevant to the care of COVID-19 patients including ICU level of care, mechanical ventilation, and death. While the EDI was able to successfully isolate groups of patients at very high and very low risk of deterioration, the overall performance as a continuous predictor was moderately low (AUROC 0.76 (95% CI 0.68-0.84), n = 174) [11] . Additionally, much of the detail surrounding the EDI's structure and internal validation has not been shared publicly. This makes the interpretation of individual predictions difficult. In this current report we have applied our previously described model, PICTURE ( P redicting I ntensive C are T ransfers and other U nfo R eseen E vents) to a cohort of patients testing positive for COVID-19 [13] . Initially developed to predict patient deterioration in the general wards, we have re-trained the model to target those outcomes considered most relevant to the COVID-19 pandemic: ICU level of care, mechanical ventilation, and death. PICTURE, like the EDI, was trained and tuned on a large non-COVID-19 cohort (128,732 encounters). Furthermore, we took extensive steps in the PICTURE framework to limit overfitting and learning missingness patterns in the data. This is critical in providing clinicians with novel, useful, and generalizable alerts as missing patterns can vary in different settings and different patient phenotypes [13] . In addition to the risk score, PICTURE also provides actionable explanations for its predictions in the form of Shapley values which may help clinicians easily interpret scores and better determine if actionability on the alert is required [14] . We validate this system in both a non-COVID-19 cohort as well as in those testing positive for COVID-19, and compare it to the EDI on the same matched cohorts. The study protocol was approved by the University of Michigan's Institutional Review Board (HUM00092309). EHR data was collected from a large tertiary, academic medical system (Michigan Medicine) from January 1, 2014 to June 1, 2020. The first five years of data (2014 -2018, n = 128,732 encounters) were used to train and validate the model while 2019 data was reserved as a hold-out test set (n = 32,754 encounters). Training, validation, and test populations were segmented to prevent overlap of multiple hospital encounters between sets. Criteria for inclusion in these three cohorts were defined as age ≥ 18 and ≤ 89 years who were hospitalized (having inpatient or other observation status) in a general ward. We excluded patients with a left ventricular assist device (LVAD), patients who were discharged to hospice, and patients whose ICU transfer was from a floor other than a general ward (e.g. operating or interventional radiology unit) in order to exclude planned ICU transfers. To be included in the COVID-19 cohort (n=430), patients must have received a positive COVID-19 test result from Michigan Medicine within a span of 14 days before or 21 days after their hospitalization. These patients were then filtered using the same criteria used in the 2019 test set, with the exception of the hospice and LVAD distinctions. Only discharged patients or those who already experienced an adverse event were included. Table 1 describes the study cohort and the frequency of individual adverse events. When compared to the non-COVID-19 test cohort from 2019, the median age of COVID-19 patients was slightly higher (64.8 vs. 61.3) and the proportion of Black and Asian patients considerably higher. The rate of adverse events was also higher, rising from slightly under 4% to 30%. The variables used as predictors were collected from the electronic health record (EHR) and broadly included vital signs and physiologic observations, laboratory and metabolic values, and demographics. Specific features were selected based on previous analysis [13] . Vital signs used in the model included heart rate, respiratory rate, pulse oximetry, Glasgow Coma Score (GCS), urine output, and blood pressure. Laboratory and metabolic features included electrolyte concentrations, glucose and lactate, and blood cell counts. Demographics include age, height, weight, race, and gender. Fluid bolus and oxygen supplementation were also included as features. A full list of features are presented in Table S .1 in the supplemental material alongside their respective mean, standard deviation, and missingness rates. The primary outcomes in the training, validation, and test cohorts (data collected from 2014 through 2019) were death, cardiac arrest (as defined by the American Heart Association's Get With The Guidelines ®), transfer to an ICU from a general ward or similar unit, or need for mechanical ventilation. Determination of ICU transfer was based on actual location or accommodation level. Outcomes in the COVID-19 positive cohort differed slightly in two respects. First, cardiac arrest information was not available at the time of writing and so was not included. Second, the emergency procedures undertaken by the hospital to accommodate the high volume of COVID-19 patients led to the delivery of critical care in non-ICU settings. Thus, ICU level of care was used to determine ICU transfer rather than actual location. In other words, implementation of ICU care that otherwise would not have been provided to patients in a non-ICU setting but was provided due to capacity constraints was determined to be ICU level care or ICU transfer. Observations occurring thirty minutes before the first event or later were discarded to be consistent with other approaches [15] . For observation level predictions, individual observations were labeled positive if they occurred within 24 hours of any of the above events, and negative otherwise. We refer to these composite adverse events as the "outcome" or "target" throughout the text. These outcomes were designed to closely follow those of a recent analysis of the EDI at Michigan Medicine [11] . Training Validation Test 2019 Age is presented as median with interquartile range (IQR). In our population, the three most frequently encountered races were White, Black, and Asian. Other races comprising less than 1% of the population each were incorporated under the "Other" heading. Sex is reported as the number and percent of females. The event rate represents a composite outcome indicating that one of the following events occurred: death, ICU transfer, mechanical ventilation, and cardiac arrest. The individual frequencies of these adverse events are also reported, and represent the number of cases where each particular outcome was the first to occur. Cardiac arrest was not used as a target in the COVID-19 positive population as the manually adjudicated data is not yet available. Please see Section 2.3 "Outcomes" for the procedure of calculating these targets. To train and evaluate the PICTURE model, we partitioned our data into four folds: a training and validation set using data from 2014 -2018, a test set using 2019 data, and a fourth set consisting of data from COVID-19 positive patients. The sets were partitioned such that multiple hospital encounters from the same individual were restricted to one cohort, preventing patient-level overlap between cohorts. Encounters with an admission date from January 1, 2014 to December 31, 2018 were used for training and validation/hyperparameter tuning (n = 128,732 encounters). These patients were further divided between training and validation sets using an 80/20% split. Those patients with an admission date between January 1, 2019 and December 31, 2019 were reserved as a hold-out test set (n = 32,754 encounters). Lastly, patients testing positive for COVID-19 from March 1, 2020 to June 1, 2020 were reserved as a separate set (n = 430 encounters). The training and validation sets were grouped into 8-hour windows to ensure that each encounter would have the same amount of observations for the same amount of time in the hospital, avoiding emphasis on patients who get more frequent updates while training the model. The 2019 and COVID-19 test sets were left in a granular format, where each new observation represented the addition of new data (e.g. an updated vital sign). Vital signs and laboratory values were forward filled such that each observation represented the most up-to-date information available as of that time, and the remaining missing values were iteratively imputed using the mean of the posterior distribution from a multivariate Bayesian regression model. This method has previously been demonstrated to reduce the degree to which tree-based models learn missingness patterns in order to bolster performance [13] . Classification was achieved using an XGBoost model (v. 0.90) using a logistic objective function with a maximum tree depth of three nodes and stopped when the validation AUPRC had not improved for 30 rounds [16] . All analysis was performed using Python 3.8.2. The EDI is a proprietary model developed by Epic Systems Corporation (Verona, WI). Michigan Medicine utilizes EPIC as its electronic medical record system and has access to the EDI tool. Similar to PICTURE, it uses clinical data that are commonly available in the EHR to make predictions regarding patient deterioration. It was trained using a similar composite outcome as a target including death, rapid response team (RRT) call, ICU transfer, and resuscitation as adverse events [11] . It is calculated every 15 minutes. Specific details surrounding its structure, parameters, or training procedures have not been shared publicly. The performance of the PICTURE model was first assessed on all 32,754 encounters in the hold-out test set comprising patients from 2019. Another early warning aggregate score, National Early Warning Score (NEWS), was used for comparison in this preliminary analysis [17] , [18] . The original NEWS was selected over the updated NEWS2 score due to evidence that its performance was found to be higher when predicting adverse events in patients at risk of respiratory failure [19] . For each observation time point, the NEWS score was calculated according to their published scoring system and compared to PICTURE scores. Performance was assessed on two scales: observation-level and encounter-level. The term "observation-level" is used to denote the performance of the model at each time the data for a patient is updated, with observations occurring 24 hours prior to a target event marked as 1 and otherwise marked as 0. "Encounter-level" describes the model performance across the entire hospital encounter for one patient, and refers to the maximum model score during the patient's stay and occuring at least 30 minutes (or longer for different minimal lead times) before the first event. The target in this case is a one if the patient ever met an outcome condition during their stay, and zero otherwise. Since the EDI makes a prediction every 15 minutes, we were not able to calculate scores at each timepoint as with NEWS. We simulated how the PICTURE score, calculated at irregular intervals each time a new data point arrives, would align with the EDI scores calculated every 15 minutes. This limited the available number of encounters to 21,215 encounters in the 2019 test set, and 401 encounters in the COVID-19 cohort. The PICTURE scores were merged onto EDI values by taking the most recent PICTURE prediction before the EDI prediction. This was to give the EDI any advantages in the alignment procedure. Figure 1 displays a visual schematic of this alignment. The two models were then evaluated using the same observation-and encounter-level methods described in the previous section. is input into the system, the EDI score is generated every 15 minutes. To give the EDI any potential advantage, PICTURE scores are aligned to EDI scores by selecting the most recent PICTURE score before each EDI prediction. In both cases, observations occurring 30 minutes before the target and after are excluded (red). For the patients who did not experience an adverse event, the maximum score was calculated across the entire encounter. 6 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 10, 2020. . https://doi.org/10.1101/2020.07.08.20145078 doi: medRxiv preprint Area under the Receiver Operating characteristic curve (AUROC) and area under the precision-recall curve (AUPRC) were used as the primary criteria for comparison between the models. AUROC can be interpreted as the probability two randomly chosen observations (one with a positive target, the other negative) are ranked in the correct order by the model prediction score. AUPRC describes the average positive predictive value (PPV) across the range of sensitivities. 95% confidence intervals were calculated for encounter-level statistics with a bootstrap method using 1,000 replications to compute pivotal confidence intervals. For observation-level statistics, block bootstrapping was used to ensure randomization between encounters and within the observations of an encounter. P-values were calculated using a generalized two-sample, one-tailed t-test. Despite the many benefits yielded by increasingly advanced machine learning models, use of these models in the medical field has lagged behind other fields. One contributing factor is the complexity of these models, which make the resulting predictions difficult to interpret and in turn make it difficult to build clinician trust [20] . To better provide insight into the PICTURE predictions, tree-based Shapley values are calculated for each observation. Borrowed from game theory, Shapley values describe the relative contribution of a feature to the model's prediction [14] , [21] . Positive values denote features that influenced the model toward a high prediction score (here indicating a higher likelihood of an adverse event), while negative values indicate the feature pushed the model toward a lower prediction score. The sum of the Shapley values across a single prediction plus the mean log-odds probability of the model is equal to the log-odds of the prediction probability. Shapley values can be used to provide insight into individual model predictions or aggregated to visualize global variable importance. The ability of the PICTURE model to accurately predict the composite target was first assessed using the 32,754 encounters in the hold-out test set from 2019. To provide a baseline for comparison, National Early Warning Scores (NEWS) were calculated alongside each PICTURE prediction output. The observation-and encounter-level AUROC and AUPRC are presented with 95% confidence intervals in Table 2 PICTURE was then compared to the EDI model on non-COVID patients in the same hold-out test set from 2019. Due to limitations in available EDI scores, the number of encounters was restricted to 21,215. These time-matched scores were again evaluated using AUROC and AUPRC on the observation-and encounter levels ( Table 3 ). (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 10, 2020. . In addition to classification performance, lead time represents another critical component of a predictive analytics' utility as it determines the amount of time clinicians have to act on the model's recommendations. We assessed the model's relative performance at different lead times in a threshold-independent manner by censoring data occurring 0.5, 1, 2, 6, 12, and 24 hours before an adverse event ( Table 4 ). In our cohort, PICTURE performs markedly better than the EDI model even when considering predictions made 24 hours or more before the actual event. Table 4 : Lead time analysis in non-COVID-19 cohort. The performance of the two models (encounter-level) at various lead times were assessed by evaluating the maximum prediction score prior to x hours before the given event, with x ranging in progressively greater intervals from 0.5 to 24. On this cohort of non-COVID-19 subjects, PICTURE consistently outperformed the EDI model at each level of censoring. 9 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 10, 2020. . https://doi.org/10.1101/2020.07.08.20145078 doi: medRxiv preprint When applied to patients testing positive for COVID-19, PICTURE performs similarly well. PICTURE scores were again aligned to EDI scores using the process outlined in Section 2.6.2. This resulted in the inclusion of 402 encounters. Table 5 presents AUROC and AUPRC values for PICTURE and EDI on both the observation and encounter level with 95% confidence intervals, and As with the non-COVID-19 cohort, a similar lead time analysis was then performed to assess the performance of PICTURE and EDI when making predictions further in advance. Thresholds were again set at 0.5, 1, 2, 6, 12, and 24 hours before the event and observations occurring after this cutoff were censored. In our cohort, PICTURE again out-performed the EDI even when making predictions 24 hours in advance ( Table 6 ) . Table 6 : Lead time analysis in COVID-19 cohort. The performance of the two models (encounter-level) at various lead times were again assessed by evaluating the maximum prediction score prior to x hours before the given event, with x ranging in progressively greater intervals from 0.5 to 24. On this cohort of COVID-19 subjects, PICTURE consistently outperformed the EDI model at each level of censoring. To provide clinicians with a description of factors influencing a given PICTURE score, we used Shapley values computed at each observation. . While many of the feature rankings appear similar between the 2019 and COVID-19 cohorts, we noted that respiratory variables such as respiratory rate, oxygen support, and SpO 2 played a more pronounced role in predicting adverse events in COVID-19 positive patients than in non-COVID-19 patients. One point of note is that the amount of oxygen support played a significant role in both cohorts. While the EDI does not use the amount of oxygen support as a continuous variable, it does have a feature termed "oxygen requirement" [11] . To demonstrate that the observed improvement of PICTURE over the EDI is not driven solely by this additional information, oxygen support was binarized and the PICTURE model retrained. While performance did decrease, indicating that the inclusion of oxygen support as a continuous variable is useful in predicting deterioration, PICTURE still outperformed the EDI on both the non-COVID-19 (difference in AUROC: 0.031, AUPRC: 0.044) and COVID-19 (difference in AUROC: 0.036, AUPRC: 0.019) cohorts. 11 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 10, 2020. . As a demonstration of the potential utility of PICTURE, an individual hospital encounter was selected and the trajectories of PICTURE and the EDI are visualized in Figure 5 . A previous study assessing the use of EDI in COVID-19 patients found that an EDI score of 64.8 or greater to be an actionable threshold to identify patients at increased risk [11] . As PICTURE scores lie on a different scale than the EDI, we determined two comparable thresholds using sensitivity and PPV on the COVID-19 cohort. Due to the high event rate in this cohort, alert thresholds in non-COVID-19 patients may be lower. At a threshold of 64.8, we found that the EDI had a Note that the PICTURE score remains low until approximately 12.5 hours before the adverse event (in this case, transfer to an ICU level of care), where it crosses the PPV-aligned threshold. Approximately 11 hours before the event, the PICTURE score peaks at a value of 0.235 and exceeds the sensitivity-aligned threshold of 0.18. After the initial peak, the PICTURE score then remains elevated, staying above the PPV-aligned threshold of 0.075 until the patient is transferred. In contrast, the EDI score never exceeded its alert threshold and it dropped when the PICTURE score increased To simulate what a clinician receiving an alert from PICTURE might encounter, the Shapley values explaining the PICTURE predictions at both alert thresholds are recorded in Table 7 below. Note that these features are dominated by respiratory features, though heart rate and temperature are also present. While these features may seem obvious in predicting the need for ICU care, it is worth highlighting that the EDI did not identify this patient as being at risk. Panel A) depicts the PICTURE predictions over 27-hours before the patient is eventually transferred to an ICU level of care (green bar). Two possible alert thresholds are noted: one (red) based on the EDI's sensitivity at a threshold of 64.8 (as suggested by [11] ) while the other (yellow) is based on the EDI's PPV at this threshold. Note that PICTURE peaks above the sensitivity-based threshold approximately 11 hours in advance of the ICU transfer, and then remains elevated over the PPV threshold until the transfer occurs. * and † represent the first time points at which PICTURE crossed each threshold, referenced in Table 7 below. Panel B) demonstrates the EDI over the same time range, with the threshold of 64.8 suggested by [11] in red. 13 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 10, 2020. Table 7 : Sample PICTURE explanations. The top 5 features corresponding to PICTURE predictions as it crosses the PPV-aligned threshold ( * ) and the sensitivity-aligned threshold ( † ) as noted in Figure 5 . These predictions represent two possible locations where a clinician could receive an alert that their patient is deteriorating. For each prediction, the top 5 features as measured by Shapley values are reported. Such information could be shared alongside the prediction score to provide better clinical utility to healthcare providers. The number of standard deviations from the mean are included for comparison, and are calculated using the COVID-19 dataset. Note that oxygenation (supplemental oxygen, SpO 2 , and respiratory rate) and temperature play a dominant role in both cases. Heart rate represented the primary difference between these two timepoints. When the PICTURE score first exceeded the PPV threshold 12.5 hours before the ICU transfer, heart rate remained at 78 bpm and was not among the top features as measured by Shapley. At 11 hours before the event, when the PICTURE score was at its highest, heart rate had jumped to 124 bpm and was the fourth-most influential feature as measured by Shapley values. The PICTURE early warning system accurately predicts adverse patient outcomes including ICU transfer, mechanical ventilation, and death at the Michigan Medicine. The ability to consistently anticipate these events may be especially valuable when considering a potential impending second wave of COVID-19 infections. The EDI is a widespread deterioration model which has recently been assessed in a COVID-19 population. Both PICTURE and the EDI were trained using approximately 130,000 encounters for general deterioration and thus are not overfit to the COVID-19 population [11] , [12] . Using a head-to-head comparison, we demonstrated that PICTURE has higher performance than the EDI at a statistically significant level (α = 5%) for both COVID-19 positive and non-COVID-19 patients. In addition, PICTURE was capable of accurately predicting adverse events as far as 24 hours before the event occured. Lastly, PICTURE has the ability to explain individual predictions to clinicians by displaying those variables which most influenced its prediction using Shapley values. This analysis is limited to a single academic medical center, and its generalizability to other healthcare systems will require future study. 15 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 10, 2020. Values are computed on the encounter-level to avoid weighting the statistics by the number of observations per patients. For example, the mean value is first calculated for each patient, and these individual means are then averaged across all patients. Standard deviation was computed by again taking the mean value for each patient, and then calculating the standard deviation across all patients. Missingness rate represents the fraction of patients for whom there was no data available for the duration of their encounter. Gender and race are recorded as 1 if the patient meets the criteria (e.g. is female) and 0 otherwise. IV fluid bolus indicates a 1/0 flag if the patient received a fluid bolus during their stay. Oxygen supplementation represents the maximum amount of oxygen received by the patient in the last 24 hours, and zero if no oxygen order was placed. Pulse pressure is defined as diastolic blood pressure subtracted from systolic. Shock index is calculated as heart rate over systolic blood pressure, then multiplied by age in the case of the age-adjusted variable. SpO 2 represents the minimum SpO 2 in the last 24 hours. Abbreviations: BUN (blood urea nitrogen), GCS (Glasgow coma score), INR (international normalized ratio), MAP (mean arterial pressure), MCH (mean corpuscular hemoglobin), MCHC (mean corpuscular hemoglobin concentration), MCV (mean corpuscular volume), MPV (mean platelet volume), PT (prothrombin time), PTT (partial thromboplastin time), RBC (red blood cell count), RDW (red cell distribution width), SpO 2 (peripheral oxygen saturation), WBC (white blood cell count). Prediction models for diagnosis and prognosis of covid-19 infection: systematic review and critical appraisal Early warning system scores for clinical deterioration in hospitalized patients: a systematic review Scoping review: The use of early warning systems for the identification of in-hospital patients at risk of deterioration Outreach and Early Warning Systems (EWS) for the prevention of intensive care admission and death of critically ill adult patients on general hospital wards Adverse Events in Hospitals: National Incidence Among Medicare Beneficiaries -10-Year Update Statistical Modeling and Aggregate-Weighted Scoring Systems in Prediction of Mortality and ICU Transfer: A Systematic Review Unplanned transfers to a medical intensive care unit: causes and relationship to preventable errors in care Inpatient transfers to the intensive care unit: delays are associated with increased mortality and morbidity The Emperor Has No Clothes Artificial Intelligence from Epic Triggers Fast, Lifesaving Care for COVID-19 Patients Evaluating a Widely Implemented Proprietary Deterioration Index Model Among Hospitalized COVID-19 Patients AI Can Help Hospitals Triage COVID-19 Patients -IEEE Spectrum Demonstrating the Consequences of Learning Missingness Patterns in Early Warning Systems for Preventative Health Care: A Novel Simulation and Solution From local explanations to global understanding with explainable AI for trees Multicenter development and validation of a risk stratification tool for ward patients XGBoost: A Scalable Tree Boosting System The ability of the National Early Warning Score (NEWS) to discriminate patients at risk of early cardiac arrest, unanticipated intensive care unit admission, and death National Early Warning Score (NEWS): Standardising the assessment of acute illness severity in the NHS A comparison of the ability of the National Early Warning Score and the National Early Warning Score 2 to identify patients at risk of in-hospital mortality: A multi-centre database study On the interpretability of machine learning-based model for predicting hypertension A Unified Approach to Interpreting Model Predictions