key: cord-1036519-hxo8h908
authors: Singh, Karandeep; Valley, Thomas S.; Tang, Shengpu; Li, Benjamin Y.; Kamran, Fahad; Sjoding, Michael W.; Wiens, Jenna; Otles, Erkin; Donnelly, John P.; Wei, Melissa Y.; McBride, Jonathon P.; Cao, Jie; Penoza, Carleen; Ayanian, John Z.; Nallamothu, Brahmajee K.
title: Evaluating a Widely Implemented Proprietary Deterioration Index Model among Hospitalized Patients with COVID-19
date: 2021-03-30
journal: Annals of the American Thoracic Society
DOI: 10.1513/annalsats.202006-698oc
sha: 93b7849e2404ee8658e15c45dd0a644d7f53fd06
doc_id: 1036519
cord_uid: hxo8h908

Rationale: The Epic Deterioration Index (EDI) is a proprietary prediction model implemented in over 100 U.S. hospitals that was widely used to support medical decision-making during the coronavirus disease (COVID-19) pandemic. The EDI has not been independently evaluated, and other proprietary models have been shown to be biased against vulnerable populations. Objectives: To independently evaluate the EDI in hospitalized patients with COVID-19 overall and in disproportionately affected subgroups. Methods: We studied adult patients admitted with COVID-19 to units other than the intensive care unit at a large academic medical center from March 9 through May 20, 2020. We used the EDI, calculated at 15-minute intervals, to predict a composite outcome of intensive care unit–level care, mechanical ventilation, or in-hospital death. In a subset of patients hospitalized for at least 48 hours, we also evaluated the ability of the EDI to identify patients at low risk of experiencing this composite outcome during their remaining hospitalization. Results: Among 392 COVID-19 hospitalizations meeting inclusion criteria, 103 (26%) met the composite outcome. The median age of the cohort was 64 (interquartile range, 53–75) with 168 (43%) Black patients and 169 (43%) women. The area under the receiver-operating characteristic curve of the EDI was 0.79 (95% confidence interval, 0.74–0.84). EDI predictions did not differ by race or sex. When exploring clinically relevant thresholds of the EDI, we found patients who met or exceeded an EDI of 68.8 made up 14% of the study cohort and had a 74% probability of experiencing the composite outcome during their hospitalization with a sensitivity of 39% and a median lead time of 24 hours from when this threshold was first exceeded. Among the 286 patients hospitalized for at least 48 hours who had not experienced the composite outcome, 14 (13%) never exceeded an EDI of 37.9, with a negative predictive value of 90% and a sensitivity above this threshold of 91%. Conclusions: We found the EDI identifies small subsets of high-risk and low-risk patients with COVID-19 with good discrimination, although its clinical use as an early warning system is limited by low sensitivity. These findings highlight the importance of independent evaluation of proprietary models before widespread operational use among patients with COVID-19.

The coronavirus disease pandemic is straining the capacity of hospitals and healthcare systems across the United States (1, 2) . Accurately identifying subgroups of patients with COVID-19 at high risk and low risk for adverse outcomes could help to alleviate this strain by better directing scarce resources to those patients at greatest need. This need has led to the development and use of clinical prediction models in patients with COVID-19. Many of these models suffer from a high risk of bias due to sample sizes too small to allow for both model development and validation (3) . Although the majority of studies have focused on newly developed models, many models already exist to detect clinical deterioration among hospitalized patients.

One of the most widely used models is the Epic Deterioration Index (EDI), which is implemented in hundreds of U.S. hospitals (4) . The EDI was developed using data from three healthcare organizations between 2012 and 2016, and it uses clinical data to calculate risk scores at regular 15-minute intervals throughout a patient's stay starting from the time of hospital admission.

Although not specific to patients with COVID-19, the EDI has been widely used during the pandemic to support decisionmaking in patients with COVID-19 (5) (6) (7) .

The widespread use of the EDI raises implementation concerns because there are no peer-reviewed publications describing its validity in any patient population. These concerns are particularly salient given that health systems are using the EDI in conflicting ways and with substantially different thresholds (7) . Even before the onset of COVID-19, publicly available information about the EDI was limited to anecdotal reports of its value in critically ill patients (8, 9) . The proprietary nature of models such as the EDI makes independent validation difficult because of a lack of complete information on the model's functional form and parameters (10) . However, independent evaluation is needed because hospital-based models often do not perform well in external validation studies and because the performance of models erodes over time as use patterns change (11) . In addition, some widely adopted proprietary models have previously been shown to be biased against Black patients even when race was not included as a predictor (12, 13) . Given that COVID-19 disproportionately impacts Black individuals with respect to its incidence and complications (14) , the validity of the EDI needs to be established generally and for vulnerable subpopulations.

These concerns have not prevented its use from being advocated (5) . An Epic Systems spokesperson recently stated that "some hospitals are now using the model with confidence," (6) whereas others suggest it is "helping save lives." (4) In this context, we sought to independently validate the ability of the EDI to predict adverse outcomes among diverse patients hospitalized with COVID-19 at a large academic medical center. We also stratified our evaluation by race, sex, and age to evaluate the model performance among key subgroups of patients. Our findings have potential implications for how the EDIcurrently deployed in hundreds of U.S. hospitals (4)-may be validated and used by healthcare systems during the COVID-19 pandemic, and more broadly in how proprietary models should be evaluated.

Our study cohort included adults 18 years and older diagnosed with COVID-19 who were admitted to Michigan Medicine (i.e., the academic health system of the University of Michigan in Ann Arbor) between March 9, 2020, and May 20, 2020, from the emergency department, outpatient clinics, and outside hospital transfers. We excluded encounters where patients were admitted directly to an intensive care unit (ICU) (n = 215) or were discharged to home or a separate facility on hospice (n = 34) or where EDI scores were not available (n = 10). Patients who transitioned to comfort care or end-of-life care in the hospital were not excluded. We also excluded patients who remained hospitalized but had not yet experienced the composite outcome described below (n = 27), because it was not possible to determine with certainty whether they would reach the primary outcome during their hospitalization. The study was approved by the Institutional Review Board of the University of Michigan Medical School, and the need for consent was waived.

The EDI is generated from a proprietary early-warning prediction model developed by Epic Systems Corporation using data that are routinely recorded within its electronic health record. Epic is one of the largest healthcare software vendors in the world, and its electronic health record is used by most U.S. News and World Report's topranked healthcare systems and reportedly includes medical records for nearly 180 million Americans (or 56% of the U.S. population) (15) .

The EDI aims to detect patients who deteriorate and require higher levels of care. Its score ranges from 0 to 100, in which the higher numbers denote a greater risk of experiencing a composite adverse outcome of requiring rapid response, resuscitation, ICU-level care, or dying in the next 12-38 hours. Details related to the specific cohorts, within which the model was developed, the model parameters, and its detailed performance characteristics have not been shared publicly or described in the published literature.

All hospitalized patients at Michigan Medicine have had calculations of the EDI as part of an ongoing evaluation of its clinical utility since late 2018; however, the EDI was not used in any clinical protocols during this time period and thus clinicians were blinded to the score. Calculations of the EDI begin immediately after hospital admission and then continue at regular 15minute intervals until discharge. Although the algorithm was developed before the COVID-19 pandemic, it includes several predictors that may be clinically relevant in patients with COVID-19, including age, vital sign measurements (systolic blood pressure, temperature, pulse, respiratory rate, oxygen saturation), nursing assessments (Glasgow Coma Scale, neurological assessment, cardiac rhythm, oxygen requirement), and laboratory values (hematocrit, white blood cell count, potassium, sodium, blood pH, platelet count, blood urea nitrogen).

We defined our primary outcome as a composite of adverse outcomes that included the first of any of the following events that occurred during the hospitalization: ICU-level care, mechanical ventilation, or in-hospital death. We chose to include these adverse events for the composite outcome because they are highly relevant in the clinical care of patients with COVID-19, in which rapid respiratory decline is frequently described.

We used scores from the EDI calculated every 15 minutes throughout the hospitalization to predict the composite adverse outcome during the hospitalization. For patients who experienced the outcome, we only used EDI scores calculated before the outcome. We evaluated the discriminative performance of the EDI using the area under the receiver-operating characteristic curve (AUC). The AUC represents the probability of correctly ranking two randomly chosen individuals (one who experienced the event and one who did not). Because the model runs every 15 minutes on all hospitalized patients, we calculated the AUC on the basis of the entire trajectory of predictions.

The AUC was calculated at the hospitalization level using the strategy defined by Henry and colleagues and Oh and colleagues (16, 17) The rationale for evaluating the AUC at the hospitalization level is that if a hypothetical alert were to be linked to a score threshold, whether the alert ever fired for any given patient would depend on whether this threshold was ever exceeded during the hospitalization. The EDI is recalculated every 15 minutes, and if a patient crossed a given alerting threshold even once, this would bring the patient to the clinician's attention if linked to an alert.

Predictions of deterioration are most beneficial when an appropriate lead time is available for action by clinicians. We therefore calculated a median lead time for the primary outcome by comparing when patients were first deemed high risk during their hospitalization (based on the "highrisk" threshold selected by intensivists) to when they experienced the outcome. In all cases, we calculated empirical 95% confidence intervals (95% CIs) for the AUC using 1,000 bootstrap replicates of our study cohort. Although it is unknown whether the EDI can be interpreted as a ORIGINAL RESEARCH probability, model calibration was assessed using a calibration curve by comparing deciles of all predicted EDI to the observed risk.

Recalculation of the AUC at the prediction level with varying prediction horizons. To enhance the comparability of our evaluation to related work in this domain, we recalculated the AUC using a 4-hour, 8hour, 12-hour, and 24-hour prediction time horizon for the outcome. In this analysis, we considered patients to only have met the outcome from the time of a prediction if the outcome occurred within the prediction horizon, and we calculated the AUC at the prediction level.

To evaluate how the EDI performs in vulnerable populations, we conducted two analyses. First, we compared the mean EDI in demographic subgroups and in those with and without the following comorbidities: cardiac arrhythmias, chronic kidney disease, chronic pulmonary disease, congestive heart failure, depression, diabetes mellitus, hypertension, liver disease, metastatic cancer, obesity, rheumatoid arthritis or other collagen vascular diseases, and solid tumors without metastases. This analysis was conducted to identify which comorbidities result in a higher EDI score, recognizing that comorbidities are not directly included in the EDI model. Then, we compared the AUC for defined by age (>65 yr vs. ,65 yr), sex, and race to determine if the EDI performs equally well in these subgroups.

Another potential use of the EDI is to identify patients at low risk who could be sent home or to a lower-acuity setting, thereby offloading hospitals. Our goal was to evaluate how well the EDI at the end of 48 hours could identify patients at a low risk of experiencing the outcome during the remainder of their hospitalization. We selected 48 hours following admission because decisions to triage patients to lower-acuity care within this timeframe may be valuable to hospital systems struggling in response to a surge of inpatient cases. For this analysis, we excluded patients who were discharged or experienced the composite outcome within the first 48 hours as triage decisions were not relevant for this group (n = 65). We did this to remove very low-risk patients who were discharged as well as very highrisk patients who experienced the primary outcome early. The AUC was calculated based on the maximum EDI in the first 48 hours, with 95% CI based on 1,000 bootstrap replicates.

We calculated sensitivities, specificities, positive predictive values, and negative predictive values across the entire spectrum of EDI thresholds. We identified two clinically actionable thresholds based on the threshold-performance plots in consultation with intensivists on our research team (T.S.V. and M.W.S.): one for identifying high-risk patients who will likely need ICUlevel care (based on the EDI throughout the hospitalization) and one for identifying lowrisk patients who may be appropriate for lower-acuity care (based on the 48-h analysis). Table 1 ). Overall, the EDI score had an AUC of 0.79 (95% CI, 0.74-0.84) as a continuous predictor of risk. The performance characteristics of the EDI score are reported in Figure 1 . Patients who met or exceeded an EDI of 68.8 had a 74% probability of experiencing the primary outcome during their hospitalization (i.e., positive predictive value) with a sensitivity of 39%, and they made up 14% of the study cohort. At this threshold, one deteriorating patient would be identified for every 1.4 patients in whom an alert was generated, a quantity also known as the number needed to evaluate (20) . The median lead time from when the threshold was first exceeded to when the outcome occurred was 24 hours (IQR, 1.4-83). The entire distribution of lead times is further described in Figure 2 . 

overpredicts the risk of experiencing the primary outcome if interpreted as a probability. We found the prediction-level AUCs to be similar to those in our primary analysis although the positive predictive values were lower due to the shorter prediction horizon (Appendix Table 3 in the online supplement, Appendix Figures 1  and 2 ). In the disparate impact analyses, we found that EDI predictions did not differ by sex or race (Appendix Table 1 ). We did find higher EDI scores for patients 65 and older and those with cardiac arrhythmias, chronic kidney disease, chronic pulmonary disease, congestive heart failure, diabetes, and hypertension. In an analysis of model performance by subgroup, the EDI performed similarly across the demographic subgroups although the model appeared to perform better in patients with liver disease as compared with those without it (P = 0.028) (Appendix Table 2 ).

In the subset of 286 patients who had not been discharged or experienced the primary outcome at 48 hours, 55 (19%) experienced the composite outcome at some point during the remainder of their hospitalization. In this setting, the EDI had a hospitalization-level AUC of 0.65 (95% CI, 0.57-0.73). The performance characteristics of the 48-hour maximum EDI in this subset of patients are reported in Figure 4 . A total of 14 (13%) patients who never exceeded an EDI of 37.9 in the first 48 hours of their hospitalization had a 90% probability of not experiencing the outcome (i.e., negative predictive value) for the remainder of the hospitalization (median remaining followup of 3.8 d [IQR, 1.7-8.9]) with a sensitivity of 91% above this threshold. Figure 5 demonstrates four examples of EDI patterns in patients with COVID-19 overlaid with the identified high-and lowrisk thresholds (>68.8 for high risk and ,37.9 for low risk, respectively). As shown in the bottom-left and top-right panels of this figure, the EDI fluctuates substantially for individual patients with each assessment over the regular 15-minute intervals.

Our study constitutes the first publicly reported independent validation of the EDI in any patient population. Our results suggest that the EDI exhibits good discrimination for the prediction of adverse outcomes in a diverse COVID-19 population. It demonstrated good performance in identifying higher-risk patients in our cohort, identifying a small proportion of patients with a positive predictive value of 74% but a relatively low sensitivity of 39%. There was no single threshold at which the EDI exhibited both a high positive predictive value and high sensitivity. Thus, from a standpoint of clinical actionability, the model's utility is somewhat limited. We found that the maximum EDI in the first 48 hours of the hospitalization may help identify a small subset of low-risk patients who may be safely transferred to lower-acuity settings (such as a field hospital), thereby conserving resources. However, 10% of the patients identified by the EDI as low risk may ultimately deteriorate, so the decision to deescalate care for patients should not be based solely on the EDI. For the vast majority of patients whose maximum EDI score falls in the intermediate-risk range, the score has limited value to guide clinical decision-making. We also noted the EDI fluctuates substantially when calculated at 15-minute intervals, in part because it only relies on the most recent value for each of the predictors. Even small changes in predictors lead to large differences in the EDI because prior normal values are ignored when a new value is obtained. Thus, we recommend that the interpretation of individual EDI scores be based on whether a patient ever exceeds specific thresholds. The substantial variation in EDI scores also underscores the notion of diminishing returns when running the model so frequently.

The proprietary nature of the EDI raises specific ethical and clinical concerns in the setting of a pandemic, in which resources may be scarce and could be allocated to higher-risk patients based on the output of this prediction model. We found no evidence that the EDI is biased against specific subgroups of vulnerable patients although the EDI was not always concordant with the observed differences in adverse outcomes (in Table 1 as compared with Appendix Table 2 ). Although chronic pulmonary disease was not associated with an adverse outcome, the maximum EDI score was higher for patients with chronic pulmonary disease as compared with those without it. On the other hand, although adverse outcomes were more likely in white patients and patients with depression, metastatic cancer, and rheumatoid arthritis 

or other collagen vascular diseases, the maximum EDI score was not different for these subgroups. Although we found no evidence for bias in the EDI score against Black patients, who are disproportionately impacted by COVID-19, the wide confidence intervals (Appendix Table 2 ) mean that such bias cannot be definitively excluded. The EDI score was higher in older individuals, which is not surprising because age is a component of the EDI. In the face of EDI being a proprietary model, many additional steps should be taken to ensure that its use is valid and improves outcomes. First, making the model parameters available would allow for a more complete validation and could benefit the public by enabling the model to be refined and compared with other existing prediction models for specific clinical applications (21) . Second, prospective validation of our proposed thresholds in other centers and clinical conditions would help validate the generalizability of the EDI. Third, linking targeted interventions to specific EDI thresholds or other clinical assessments would assess whether the EDI can improve patient outcomes.

When comparing our observed performance of the EDI in patients with COVID-19 against other models reported in the literature, it is interesting to note the AUC reported in our study is much lower than other models in the COVID-19 literature. This could be due in part to the overfitting of other models in the setting of relatively small sample sizes. For instance, Bai and colleagues report an AUC of 0.95 for inpatient clinical deterioration with a model developed and validated in a cohort of 133 patients with 75 predictors (22) . By contrast, the EDI was developed on a cohort drawn from more than 130,000 hospitalizations (6) . Our findings closely match the observed performance of other deterioration indices 

that have been validated in patients without COVID-19. In a study of 649,418 hospitalizations, the Advanced Alert Monitor identified deteriorating patients with an AUC of 0.82 (23) . The electronic Cardiac Arrest Risk Triage score identified deteriorating patients with an AUC of 0.80 and 0.79 in two separate evaluations (23, 24) . The Rothman index identified clinical deterioration with an AUC of 0.76 (25) .

Our study should be interpreted in the context of the following limitations. Our evaluation was limited by its focus on a single academic medical center and a relatively small number of patients. However, our cohort of nearly 400 patients was diverse in sex and race and larger than many earlier reports. As compared with a recently described large cohort of 5,700 patients hospitalized with COVID-19 in New York, our study cohort had a higher proportion of Black patients (43% vs. 23%) and patients with chronic kidney disease (38% vs. 5%), congestive heart failure (21% vs. 7%), and hypertension (75% vs. 57%) and similar proportions of women (43% vs. 40%) and patients with diabetes (42% vs. 34%) and obesity (42% vs. 42%) (2). Our proposed EDI ORIGINAL RESEARCH thresholds may be influenced by local factors, including patterns of COVID-19 testing, triage, and decision-making about hospital admissions and hospitalto-hospital transfers that contributed to our study cohort. These EDI thresholds should be validated in other settings to assess their generalizability. Additionally, clinically relevant EDI thresholds may differ between patients hospitalized with COVID-19 and other conditions; thus, a broader validation of the EDI among hospitalized patients is warranted.

Despite these limitations, our findings have important implications for hospitals with access to the EDI that may be under substantial capacity constraints and strain from managing patients with COVID-19. Our study supports, in part, the role of the EDI to identify a small subset of high-risk patients who may benefit from additional resources and higher-level care and another limited subset of low-risk patients who may be cared for safely in lower-acuity settings. It also suggests opportunities to tailor and improve risk prediction for this condition beyond the EDI as data accumulate on patients with COVID-19. Finally, it indicates the need for institutions to independently validate widely used proprietary models where the vendor is commonly the only source of model validation. n Author disclosures are available with the text of this article at www.atsjournals.org.

Fair allocation of scarce medical resources in the time of covid-19

Presenting characteristics, comorbidities, and outcomes among 5700 patients hospitalized with COVID-19 in the New York city area

Prediction models for diagnosis and prognosis of covid-19 infection: systematic review and critical appraisal

Artificial intelligence from Epic triggers fast, lifesaving care for COVID-19 patients

Stanford launches an accelerated test of AI to help with covid-19 care

AI can help hospitals triage COVID-19 patients

Hospitals are using AI to predict the decline of Covid-19 patients-before knowing it work

Ochsner health adopts new AI technology to save lives in real-time

XGM 2019: Epic experts expand and share their knowledge in Verona. Verona, WI: Epic Systems

Predictive analytics in health care: how can we know it works?

Calibration drift in regression and machine learning models for acute kidney injury

Dissecting racial bias in an algorithm used to manage the health of populations

Face recognition vendor test part 3: demographic effects

Characteristics associated with racial/ethnic disparities in COVID-19 outcomes in an academic health care system

We've spent billions to fix our medical records, and they're still a mess: here's why

A targeted real-time early warning score (TREWScore) for septic shock

A generalizable, data-driven approach to predict daily risk of Clostridium difficile infection at two large academic health centers

pROC: display and analyze ROC curves. version 1.16.2. 2020 [updated

The runway package for R. github

Why the C-statistic is not informative to evaluate early warning scores and what metrics to use

Machine learning in clinical journals: moving from inscrutable to informative

Predicting COVID-19 malignant progression with AI techniques

Development and validation of an electronic medical record-based alert score for detection of inpatient deterioration outside the ICU

Comparison of the between the Flags calling criteria to the MEWS, NEWS and the electronic Cardiac Arrest Risk Triage (eCART) score for the identification of deteriorating ward patients

Rothman Index variability predicts clinical deterioration and rapid response activation

ORIGINAL RESEARCH

Evaluating a Deterioration Index