key: cord-1053375-jmfbc640 authors: Sottile, Peter D; Albers, David; DeWitt, Peter E; Russell, Seth; Stroh, J N; Kao, David P; Adrian, Bonnie; Levine, Matthew E; Mooney, Ryan; Larchick, Lenny; Kutner, Jean S; Wynia, Matthew K; Glasheen, Jeffrey J; Bennett, Tellen D title: Real-Time Electronic Health Record Mortality Prediction During the COVID-19 Pandemic: A Prospective Cohort Study date: 2021-05-10 journal: J Am Med Inform Assoc DOI: 10.1093/jamia/ocab100 sha: 546fbd5de37c61eabeda95677b7c02f0773a2c02 doc_id: 1053375 cord_uid: jmfbc640 OBJECTIVE: To rapidly develop, validate, and implement a novel real-time mortality score for the COVID-19 pandemic that improves upon SOFA for decision support for a Crisis Standards of Care team. MATERIALS AND METHODS: We developed, verified, and deployed a stacked generalization model to predict mortality using data available in the EHR by combining five previously validated scores and additional novel variables reported to be associated with COVID-19-specific mortality. We verified the model with prospectively collected data from 12 hospitals in Colorado between March 2020 and July 2020. We compared the area under the receiver operator curve (AUROC) for the new model to the SOFA score and the Charlson Comorbidity Index. RESULTS: The prospective cohort included 27,296 encounters, of which 1,358 (5.0%) were positive for SARS-CoV-2, 4,494 (16.5%) required intensive care unit care, 1,480 (5.4%) required mechanical ventilation, and 717 (2.6%) ended in death. The Charlson Comorbidity Index and SOFA scores predicted mortality with an AUROC of 0.72 and 0.90, respectively. Our novel score predicted mortality with AUROC 0.94. In the subset of patients with COVID-19, the stacked model predicted mortality with AUROC 0.90, whereas SOFA had AUROC of 0.85. DISCUSSION: Stacked regression allows a flexible, updatable, live-implementable, ethically defensible predictive analytics tool for decision support that begins with validated models and includes only novel information that improves prediction. CONCLUSION: We developed and validated an accurate in-hospital mortality prediction score in a live EHR for automatic and continuous calculation using a novel model that improved upon SOFA. OBJECTIVE To create a live predictive analytics scoring system to support Crisis Standards of Care (triage) decisions. The system should have the following characteristics: ethically defensible, continuously adaptable/updateable with new data and model information, temporally dependent, as personalized as possible, formed with both well-established/validated scoring models and novel models based on potentially preliminary data and information sources, quickly computable so that refreshed scores can be generated on the order of minutes, and computable with data available in a real-time electronic health record system. The SARS-CoV-2 virus has infected >70 million and killed >1.5 million people in the year since its origination (December 2019). [1] The resulting pandemic has overwhelmed some regions' health care systems and critical care resources, forcing the medical community to confront the possibility of rationing resources. [2, 3] In the United States, critical care triage guidance in the setting of resource scarcity is produced at the state-level through Crisis Standards of Care (CSC) protocols. [4, 5] These protocols attempt the difficult task of ethically allocating scarce resources to individuals most likely to benefit, with the aim of saving the most lives. [6] [7] [8] To accomplish this, CSC protocols use organ dysfunction scores and chronic comorbidity scores to assess patient survivability. Ideally, scoring would avoid systematic bias and be generalizable, accurate, flexible to circumstance, and computable within electronic health record (EHR) systems with data collected in real-time. [9] At the foundation of most CSC protocols is the Sequential Organ Failure Assessment (SOFA) score. [10, 11] SOFA and other acuity scores, e.g., SAPSII and APACHE, are well-validated but have significant limitations. They were developed over 20 years ago before widespread electronic health records (EHRs), are rigid regarding context, and were designed to measure severity of illness and predict 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 mortality based a few data points. [12] [13] [14] [15] [16] [17] Although SOFA predicts mortality from influenza pneumonia poorly, it was operationalized for use in patients with COVID-19. [18, 19] Optimizing the accuracy of mortality predictions is critical for medical triage because the decision to withhold or withdraw lifesustaining therapies is heavily influenced by a single score in many states' CSC protocols. [11] The COVID-19 pandemic created an emergent need for a novel, accurate, and location and contextsensitive EHR-computable tool to predict mortality in hospitalized patients with and without COVID-19. Because developing a new score can take years, a predictive model must rely on well-validated scores. In contrast, COVID-19 is a novel disease for which existing scores may be of limited but unknown predictive value. As such, a predictive framework relying on multiple previously validated scores that can incorporate new information but only keeps the new inputs that explicitly improve performance is required. Stacked generalization provides a solution. [20] A stacked model is built upon one or more baseline model(s) (e.g. SOFA) and incorporates additional models only when they improve prediction. [21] We rapidly developed, validated, and deployed a novel mortality score for triage of all hospitalized patients during the COVID-19 pandemic by stacking SOFA, qSOFA, a widely used pneumonia mortality score, an acute respiratory distress syndrome (ARDS) mortality model, and a comorbidity score. [22] [23] [24] [25] [26] We then integrated recently reported predictors that may reflect COVID-19 pathophysiology. To test the novel model, we conducted a prospective cohort study of acutely ill adults with and without COVID-19 disease. Because model development and training began before we had accumulated a large number of COVID-19 patients, we started by developing the novel mortality score using a multi-hospital retrospective cohort of 82,087 patient encounters ( Figure 1B ). As we accumulated COVID-19 patients, we conducted a prospective cohort study to validate the novel mortality score in patients with and without COVID-19. Our work was anchored by four goals. First, to use SOFA as a baseline and address its limitations through stacked generalization, adding other models with the potential to improve robustness and predictive performance. Second, to integrate and test potential COVID-19-specific predictors. Third, to rapidly deploy the new model in a live EHR across a 12-hospital system that serves more than 1.9 million patients. Fourth, to validate model performance prospectively. The Colorado Multiple Institutional Review Board approved this study. We originally developed, validated and deployed the model using estimates from retrospective data, while simultaneously building technical capacity to transition to a model estimated on prospective data. The time from conception (March 2020) to deployment of the new model across the health system (April 2020) was one month. The model now generates a mortality risk estimate every 15 minutes for every inpatient across the health system. We then prospectively observed model performance through the end of July 2020. This study design is consistent with recent learning health system studies. [27] Because of the rapidly evolving pandemic, we built a data pipeline for the stacked mortality model to update as new data were captured from the EHR. Rapid development and implementation of a new score in a real-time EHR requires a full clinical and informatics pipeline including skilled data warehousing, data wrangling, machine learning, health system information technology (IT), and clinical and ethics personnel working in sync. [28] [29] [30] All data flowed to the study team from UCHealth's Epic instance through Health Data Compass (HDC), the enterprise data warehouse for the University of Colorado Anschutz Medical Campus ( Figure 1A ). [31] HDC is a multi- 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 institutional data warehouse that links inpatient and outpatient electronic medical data, state-level allpayer claims data, and the Colorado Death Registry. The retrospective cohort included all encounters of patients >14 years old hospitalized at any of UCHealth's 12 acute care hospitals between August 2011 and March 4, 2020 whose hospital stay included admission to either an intensive care unit (ICU) or intermediate care unit. We restricted the retrospective data to encounters completed before March 5, 2020, the date of the first reported COVID-19 case in Colorado. We excluded encounters with a do not attempt resuscitation order placed within 12 hours of admission or a duration exceeding 14 days, as mortality after prolonged hospitalization likely represents different physiology than mortality from an acute event. The prospective cohort included all encounters of patients >14 years old hospitalized at any of UCHealth's 12 hospitals between March 15, 2020 (the date UCHealth halted elective procedures) through July 2020. Because CSC protocols apply to all hospitalized patients during a crisis, we included all inpatients regardless of level of care or COVID-19 status. We excluded encounters with a do not attempt resuscitation order placed within 12 hours of admission, patients who were still admitted, and encounters longer than 30 days. 21 The prospective cohort included a total of 28,538 encounters between March 15 th , 2020 and July 2020 ( Figure 1C ). Of these, 1,148 (4.0%) were excluded because the patient remained in hospital at the time of data censoring: in-hospital survival could not be assessed. Additionally, we excluded 70 and 24 encounters respectively due to active DNR and encounter length >30 days. Of the remaining 27,296 encounters, 1,358 (5.0%) were positive for SARS-CoV-2, 4,494 (16.5%) included intensive care unit (ICU)level care, 1,480 (5.4%) included invasive mechanical ventilation, and 717 (2.6%) died during the hospitalization. Of the 717 patients who received mechanical ventilation, 408 (27.6%) died. Additional demographics are shown in Table 1 , eTable 1, and eTable 2. We developed a model using stacked generalization to predict mortality. [32] [33] [34] A stacked regression model takes other component models as covariates and estimates weights in accordance with their predictive power. [34] We chose ridge regularized logistic regression as the top-level model to limit overfitting and to address correlation between the component models. Stacking allows for robust, accurate, and interpretable evaluation of the underlying models. In our case, because the second model level was a regularized logistic regression, we could observe the contribution of the first level models explicitly. Importantly, the stacked model never performs worse than the most accurate component Our stacked regression construction takes six logistic regression mortality models as covariates ( Figure 2 ). Four are validated organ dysfunction or pneumonia/ARDS mortality prediction tools, a fifth is a comorbidity score, and a sixth is novel and COVID-specific. These models include: (1) SOFA [12] , (2) qSOFA [22] , (3) the CURB-65 adult pneumonia mortality score [25] , (4) a modified version of an ARDS mortality model [24] , (5) a Charlson Comorbidity Index, and (6) a model made up of laboratory measures 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 associated with COVID-19 disease severity or mortality (eMethods, eTable 3). [26, 35] This model includes, for example, D-dimer, lactate dehydrogenase (LDH), absolute lymphocyte count (ALC), and creatinine kinase (CK, eMethods, eTable 3). [36] [37] [38] The ARDS mortality model was attenuated to include the subset of predictors reliably available in structured form in live EHRs. We fit multiple forms of qSOFA, SOFA, and CURB-65 in an attempt to find the best balance of parsimony and knowledge gained ( Figure 2 ). Variables such as gender, race, and disability status were not included in any models as per bioethics recommendations to avoid potential bias. Only the summary score from the Charlson Index was included; no individual comorbidities were input into the models in order to avoid socioeconomic bias associated with some diagnoses, e.g. diabetes. Probability of mortality varies over the hospital course (Appendix B) and can be estimated at any time during the hospitalization. In order to estimate and validate the model, we selected a single reference time point against which to make a prediction -when the SOFA score reached its maximum for the encounter. The retrospective data used to estimate the model included only patients with a definitive outcome, either discharge or death. In order to train the models on retrospective data, we needed an effective "normalizing" point, a single point in time to predict eventual mortality, acknowledging that patients are nonstationary and enter the hospital in one state and continuously change until they leave in that state or another. If we estimated the models from retrospective data using every time point of every patient, we would impose a severe selection bias (patients with long stays would more heavily influence the model than those with short stays). Instead, we needed to select a single reference time point per patient to use to estimate the models. To be conservative and to avoid assuming knowledge about the future health trajectory of current patients the models were being applied to, we assumed that in production the current score (at 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 each time step) of the current hospital patients was the worst SOFA score (most organ dysfunction) they would experience. To estimate the models to apply to that situation, we computed the SOFA for every patient in the retrospective training dataset along their entire stay. We then found the time point when the SOFA reached its peak for each patient. We then used that time point as the reference and trained the models using the covariates from that time point. Operationally, this framework allows for real-time mortality prediction under the conservative assumption that the current measured state of the patient is the worst state the patient will experience. While this assumption will not be correct for all moments in time, it effectively underestimates the patient's overall mortality, reducing the chance for premature limitation of critical care resources if used for triage decisions. We divided the retrospective data 40%-40%-20% for estimating the baseline logistic regression models, estimating the stacked model, and evaluating the stacked model, respectively ( Figure 2 ). We estimated the stacked models with regularized (ridge) logistic regression and used 3-fold cross-validation to select a regularization parameter. The final stacked model was evaluated using empirical-bootstrap-estimated confidence intervals (CIs) and a primary metric of area under the receiver operator curve (AUROC). We validated the stacked model using the prospective cohort and the AUROC. We chose AUROC as the accuracy metric because the primary goal of the mortality score was to generate a rank-ordered list of patients with associated survival probabilities to inform the allocation of scare resources. The AUROC is an estimate of the probability of correctly ranking a case compared to a non-case. We also estimated other accuracy metrics including positive predictive value (PPV), sensitivity, specificity, accuracy, and F1 measure (eTable 7) as well as area under the precision-recall curve (AUPRC, 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 Table 3 and eTable 5). We evaluated calibration using Brier's score and Cox calibration regression To evaluate the impact of COVID-19 on mortality prediction, we as sensitivity analyses we retrained the model using the same training strategy but limited training data to the prospectively collected data (sensitivity analysis 1, Figure 2 ) and to prospectively collected data from patients with COVID-19 (sensitivity analysis 2, Figure 2 ). We divided the cohort of patients with COVID-19 40%-40%-20% for estimating the baseline logistic regression models, estimating the stacked model, and evaluating the stacked model, respectively. The data underlying this article were provided by UCHealth by permission and cannot be shared. Analytic code will be made available on GitHub upon request to the corresponding author. This novel score was developed with the purpose of optimizing mortality prediction for decision support for crisis triage. Consequently, the score parameters needed to fall with the ethical framework developed for crisis triage. In catastrophic circumstances the goal of a resource allocation processes should be to provide the most benefit to as many people as possible, and to do so in ways that sustain social cohesion and trust in the healthcare system. To maintain trust, recommendations for rationing of resources must be made prospectively, transparently and consistently across the institution and region, and by decision-makers independent of the care team. For this reason, the target users at our health system were members of a triage team that would be activated if CSC became necessary. The triage Moreover, any decision to ration resources must embrace a commitment to fairness and a proscription against rationing based on non-clinical factors such as race, gender, sexual orientation, disability, religious beliefs, citizenship status, or "VIP," socioeconomic, or insurance status. [39] [40] [41] [42] Consequently, factors such a race and potential proxies of race were excluded from score development, even if they had the potential to improve accuracy. Ethical considerations for score development are more fully described in Appendix C. When validated using the prospective cohort, the individual component models predicted point-wise mortality (estimates of mortality risk ranging from 1-99%) with AUROCs ranging from 0.72 (Charlson Comorbidity Index) to 0.90 (SOFA) ( Table 3 ). The stacked model predicted point-wise mortality better than any individual model: AUROC 0.94 (Figure 3 ). Most prospective encounters (95.7%) had predicted point-wise mortalities less than 10%. Within this group, observed mortality was only 1.0%, suggesting that the stacked model accurately identifies patients with low mortality (eTable 4). Table 3 shows AUROC and AUPRC for each of the component models and the final stacked model. Models were trained and validated on the initial retrospective cohort. The models were then validated on the prospective cohort and on the subset of patients with COVID-19. The AUROC and AUPRC for the retrospective cohort were based on a 20% holdout of the encounters for testing and evaluation. The prospective validation cohort reflects expected performance when running in a live EHR for both COVID-19 positive and negative patients. Bootstrapped 95% confidence intervalsare shown for both AUROC and AUPRC. In patients with COVID-19, the AUROC for SOFA, CURB-65, the Charlson Comorbidity Index, and novel variables was 0.85, 0.90, 0.75, and 0.91 respectively. In this subset of patients, the stacked model predicted mortality with an AUROC of 0.90. Additional performance metrics including precision and recall are shown in Figure 3 . The stacked model predicted mortality with narrowest 95% confidence intervals at the extremes of predicted mortality ( Figure 5 ). Even at moderate predicted mortalities, 95% confidence intervals were generally narrower than ten percentage points. On average, patients who died had estimates of mortality probability that were high at admission and remained high ( Figure 6 ). Patients who survived tended to have, on average, a much lower probability of mortality and a relatively smooth trajectory. In sensitivity analyses, we generated models predicting mortality at 3 and 7 days after admission Our stacked model's ability to predict mortality is tailored to our patient population in Colorado and could easily be tailored to smaller populations. This is important given the varied experiences with COVID-19. Our in-hospital (12% versus 21%) and ventilator mortality rates (35% versus 88%) were substantially lower than a New York cohort from the first wave of the pandemic. [48] Our mortality rates approach those expected for moderate-severe ARDS. [49, 50] There are potentially many explanations for these differences, including younger age, difference in comorbidities, differences in therapeutic interventions, and learning from the experience of earlier effected areas. Moreover, the utilization of ICU level of care and mechanical ventilation varies widely across the world: in New York, 14.2% of patients were treated in an ICU and 12.2% of patient received mechanical ventilation. In contrast, in a cohort of patients in China, 50.6% of patients were admitted to an ICU and 42.2% received mechanical ventilation. [37, 38, 43] Such differences may affect the predictive characteristics of a mortality score. Moreover, we found that patients with COVID-19 have unique characteristics and may benefit from specific mortality prediction models. Therefore, utilizing EHR data streams allows for flexibility to add additional components and retrain the stacked model as new knowledge and clinical experience accumulates. Importantly for generalization, the model can be tuned in real-time to other local patient populations and disease characteristics. Several aspects of the informatics infrastructure and workflow are important. First, such a rapid development process would have been impossible without a robust data warehouse staffed by experts with deep knowledge of EHR data and common clinical data models. The availability of high-quality data is known to be among the largest challenges in clinical applications of machine learning. [51] Second, our data science team was in place and had substantial shared experience with data from the health system. It would be extremely challenging to either rapidly hire or outsource the necessary expertise during a 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 pandemic. Third, our data science team already had access to highly capable cloud-based and onpremises HIPAA-compliant computational environments. Establishing the processes and controls for such an environment takes time and expert human resources; our campus had already made those investments. Fourth, our multidisciplinary team included leadership, a variety of potential end-users, and experts from ethics, clinical informatics, machine learning, and clinical care. [28] This diversity critically grounded the project in ethical principles and pragmatic clinical realties and allowed us to quickly iterate to a practical, implementable, and interpretable model. Because of urgent operational needs, we also had full institutional and regulatory support. Finally, we evaluated the model prospectively, an important gold-standard not often met by new machine learning-based informatic tools. [28] Of note, there are many reports in the literature describing development of predictive models using EHR data, but very few reports of the implementation of those models in a live EHR for clinical use. In this case, the total elapsed time from including data extraction, model construction, 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 predictions. [52] [53] [54] However, we chose methods that were robustly estimable and would allow for transparent interpretation of underlying model contributions to the overall score. Fourth, in-hospital mortality may not be the optimal metric to make triage decisions. One-year mortality or other related outcome measures may be a better metrics but, given the desire to validate a mortality predictor quickly, longer-term outcomes were not available. Fifth, our data and patient population are specific to Colorado and results may differ geographically. Sixth, while a multidisciplinary group of experts designed this score to minimize potential bias from race, ethnicity, or social economic status, this has not been rigorously validated and is the focus of ongoing research. Finally, some clinical indicators of illness severity were not included in the models, e.g., prone positioning, continuous renal replacement therapy, and radiographic results. These data may improve mortality prediction but are difficult to routinely and reliably auto-extract from the EHR. We developed a novel and accurate in-hospital mortality score that was deployed in a live EHR and automatically and continuously calculated for real-time evaluation of patient mortality. The score can be tuned to a local population and updated to reflect emerging knowledge regarding COVID-19. Moreover, this score adheres to the ethical principles necessary for triaging. [39] [40] [41] [42] Further research to test multicenter score performance, refine mortality prediction over longer periods of time, and investigate the optimal methods to use such a score in a CSC protocol is needed. We would like to acknowledge Sarah Davis, Michelle Edelmann, and Michael Kahn at Health Data Compass. Author Contributions: TDB had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. PDS, DA, PED, SR, JS, TDB contributed substantially to the study design, data acquisition, data analysis and interpretation, and the writing of 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 the manuscript. DPK, BA, RM, and LL contributed substantially to data acquisition, verifying data integrity, and the writing of the manuscript. MEK contributed substantially to study design and the writing of the manuscript. DA, MEL and PDS conceptualized and initially designed the statistical modeling framework. JSK, MKW, JJG contributed substantially to the study design, data acquisition, and the writing of the manuscript. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 Legend: The purpose of the main stacked model was to create a ranked patient list by probability of mortality. If the model was to be used to as part of a clinical decision support alert, then a threshold for the estimated probability would need to be used to define when an alert fires. Figure 3 shows common model performance metrics as a function of the threshold. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 Figure 6 Legend: This figure shows smoothed average probability of mortality over the course of the hospitalization, stratified by actual mortality. On average, patients who died had mortality probability estimates much higher than those who did not die, even shortly after admission. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 https://mc.manuscriptcentral.com/jamia Journal of the American Medical Informatics Association 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 The retrospective cohort was used for training and validation (in a 40%-40%-20% split). The prospective and COVID-19 positive cohorts were used to validate the retrospectively trained model. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 The purpose of the main stacked model was to create a ranked patient list by probability of mortality. If the model was to be used to as part of a clinical decision support alert, then a threshold for the estimated probability would need to be used to define when an alert fires. Figure 4 shows common model performance metrics as a function of the threshold. 177x101mm (300 x 300 DPI) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 This figure shows the width of 95% confidence interval (y-axis) around the stacked model mortality probabilities estimates at each potential value for estimated probability. Confidence intervals were narrowest at the extremes of mortality probability (likely the most actionable predictions, thus the predictions with the highest stakes). 177x101mm (300 x 300 DPI) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 This figure shows smoothed average probability of mortality over the course of the hospitalization, stratified by actual mortality. On average, patients who died had mortality probability estimates much higher than those who did not die, even shortly after admission. 177x101mm (300 x 300 DPI) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 An interactive web-based dashboard to track COVID-19 in real time IHME COVID-19 Forecasting Team. Modeling COVID-19 scenarios for the United States Simple triage scoring system predicting death and the need for critical care resources for use during epidemics The Simple Triage Scoring System (STSS) successfully predicts mortality and critical care resource utilization in H1N1 pandemic flu: A retrospective analysis A Modified Sequential Organ Failure Assessment Score for Critical Care Triage Ethical Triage Demands a Better Triage Survivability Score Ventilator Triage Policies During the COVID-19 Pandemic at U.S. Hospitals Associated With Members of the Association of Bioethics Program Directors Variation in Ventilator Allocation Guidelines by US State During the Coronavirus Disease 2019 Pandemic: A Systematic Review The SOFA (Sepsis-related Organ Failure Assessment) score to describe organ dysfunction/failure. On behalf of the Working Group on Sepsis-Related Problems of the European Society of Intensive Care Medicine Use of the SOFA score to assess the incidence of organ dysfunction/failure in intensive care units: results of a multicenter, prospective study. Working group on "sepsis-related problems" of the European Society of Intensive Care Medicine Efficacy and accuracy of qSOFA and SOFA scores as prognostic tools for community-acquired and healthcare-associated pneumonia A new Simplified Acute Physiology Score (SAPS II) based on a European/North American multicenter study The use of maximum SOFA score to quantify organ dysfunction/failure in intensive care. Results of a prospective, multicentre study Serial Evaluation of the SOFA Score to Predict Outcome in Critically Ill Patients An observational cohort study of triage for critical care provision during pandemic influenza An assessment of the validity of SOFA score based triage in H1N1 critically ill patients during an influenza pandemic Stacked generalization The Elements of Statistical Learning Assessment of Clinical Criteria for Sepsis Performance of the quick Sequential (sepsis-related) Organ Failure Assessment score as a prognostic tool in infected patients outside the intensive care unit: a systematic review and meta-analysis Predictors of hospital mortality in a population-based cohort of patients with acute lung injury* Short-term mortality of adult inpatients with communityacquired pneumonia: external validation of a modified CURB-65 score Updating and Validating the Charlson Comorbidity Index and Score for Risk Adjustment in Hospital Discharge Abstracts Using Data From 6 Countries Balanced Crystalloids versus Saline in Critically Ill Adults Do no harm: a roadmap for responsible machine learning for health care Leveraging Clinical Expertise as a Feature -not an Outcomeof Predictive Models: Evaluation of an Early Warning System Use Case Relationship between nursing documentation and patients' mortality Feature-Weighted Linear Stacking Comparing Bayes Model Averaging and Stacking when Model Approximation Error cannot be Ignored Stacked regressions A new method of classifying prognostic comorbidity in longitudinal studies: Development and validation Risk Factors Associated With Acute Respiratory Distress Syndrome and Death in Patients With Coronavirus Disease Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: a retrospective cohort study Clinical Characteristics of Covid-19 in New York City Principles for allocation of scarce medical interventions The Toughest Triage -Allocating Ventilators in a Pandemic Triage of Scarce Critical Care Resources in COVID-19 An Implementation Guide for Regional Allocation: An Expert Panel Report of the Task Force for Mass Critical Care and the American College of Chest Physicians Who should receive life support during a public health emergency? Using ethical principles to improve allocation decisions Acute Physiology and Chronic Health Evaluation II Score as a Predictor of Hospital Mortality in Patients of Coronavirus Disease Validation of pneumonia prognostic scores in a statewide cohort of hospitalised patients with COVID-19 Assessment of risk scores in Covid-19 Development and Validation of a Clinical Risk Score to Predict the Occurrence of Critical Illness in Hospitalized Patients With COVID-19 Discriminant Accuracy of the SOFA Score for Determining the Probable Mortality of Patients With COVID-19 Pneumonia Requiring Mechanical Ventilation Presenting Characteristics, Comorbidities, and Outcomes Among 5700 Patients Hospitalized With COVID-19 in the New York City Area Patterns of Care, and Mortality for Patients With Acute Respiratory Distress Syndrome in Intensive Care Units in 50 Countries Respiratory Pathophysiology of Mechanically Ventilated Patients with COVID-19: A Cohort Study Machine Learning in Medicine Gaussian Processes in Machine Learning Conditional expectation estimation through attributable components Enabling personalized decision support with patientgenerated data and attributable components Funding: PS is supported by NIH K23 HL 145001, DA and ML by NIH R01 LM012734, DK by NIH K08 HL125725, TB by NIH UL1 TR002535 and NIH UL1 TR002535 -03S2. ALC -absolute lymphocyte count APACHE II -Acute Physiology and Chronic Health Evaluation ARDS -acute respiratory distress syndrome AUROC -area under the receiver operating curve CCI -Charlson Comorbidity Index CI -confidence interval CK -creatinine kinase CSC -crisis standard of care EHR -electronic health record HDC -Health Data Compass ICU -intensive care unit IT -information technology LDH -lactate dehydrogenase SAPS II -Simplified Acute Physiology Score SOFA -Sequential Organ Failure Assessment PSI -Pneumonia Severity Index