key: cord-0827335-5xgttgbs authors: Jezmir, Julia L.; Bharadwaj, Maheetha; Chaitoff, Alexander; Diephuis, Bradford; Crowley, Conor P.; Kishore, Sandeep P.; Goralnick, Eric; Merriam, Louis T.; Milliken, Aimee; Rhee, Chanu; Sadovnikoff, Nicholas; Shah, Sejal B.; Gupta, Shruti; Leaf, David E.; Feldman, William B.; Kim, Edy Y. title: Performance of Crisis Standards of Care Guidelines in a Cohort of Critically Ill COVID-19 Patients in the United States date: 2021-07-28 journal: Cell Rep Med DOI: 10.1016/j.xcrm.2021.100376 sha: 8f66c52497bd8358116f4738cf591a228dacd794 doc_id: 827335 cord_uid: 5xgttgbs Many US states published crisis standards of care (CSC) guidelines for allocating scarce critical care resources during the COVID-19 pandemic. However, the performance of these guidelines in maximizing population benefit has not been well tested. In 2,272 adults with COVID-19 requiring ICU admission drawn from the STOP-COVID multicenter cohort, we tested three approaches to CSC algorithms: SOFA scores grouped into ranges, SOFA score ranges plus comorbidities, and a hypothetical approach using raw SOFA scores not grouped into ranges. We found that area under receiver operating characteristic (AUROC) curves for all three algorithms demonstrate only modest discrimination for 28-day mortality. Adding comorbidity scoring modestly improves algorithm performance over SOFA scores alone. The algorithm incorporating comorbidities has modestly worse predictive performance for Black compared to White patients. CSC algorithms should be empirically examined to refine approaches to the allocation of scarce resources during pandemics and to avoid potential exacerbation of racial inequities. During the Coronavirus Disease 2019 (COVID- 19) pandemic, more than 30 U.S. states developed crisis standards of care (CSC) guidelines. [1] [2] [3] These guidelines are designed to help hospitals allocate resources, such as ventilators, if they became scarce. 4 Unlike the "all-come, all-served" promise of hospital resources during non-crisis situations, CSC guidelines aim to maximize population-wide benefit. [5] [6] State guidelines generally describe ethical principles and outline triage algorithms for resource allocation. [1] [2] To maximize population-wide benefit, these algorithms aim to identify patients most likely to survive if offered scarce resources. Nearly 90% of states with CSC triage algorithms adapted Sequential Organ Failure Assessment (SOFA) or Modified SOFA (MSOFA) scores to predict short-term prognosis (i.e., survival to hospital discharge) in an effort to maximize the number of lives saved. [1] [2] [7] [8] [9] [10] States vary in their use SOFA/MSOFA scores-for example, by grouping scores in different ranges to assign priority points, or by modifying scoring calculations. [1] [2] Approximately 70% of states also incorporate measures of comorbidities in priority scoring, which may affect both short-and long-term prognosis. [1] [2] Additionally, some states use a multicomponent model that incorporates factors such as estimated survival or duration of benefit or need in addition to SOFA scores. 1 While ethicists have debated triage approaches, CSC algorithms have had limited empirical testing in the COVID-19 pandemic. In prior studies of non-COVID-19 ICU cohorts, different CSC algorithms yield different prioritization results. 11 In a recent study of 675 critically ill patients with COVID-19, raw SOFA scores alone calculated at the time of intubation had limited ability to predict mortality. 12 Given the poor discrimination by SOFA scores, we hypothesized that algorithms that incorporate comorbidities, in addition to SOFA scores, have J o u r n a l P r e -p r o o f superior discriminant ability compared to algorithms that use SOFA scores alone. Since comorbidities are associated with race and ethnicity, [13] [14] [15] [16] [17] [18] we also hypothesized that incorporating comorbidities could alter the performance of CSC algorithms by race and ethnicity. In a multicenter cohort study of critically ill patients with COVID-19 admitted to ICUs across the US, we evaluated two representative CSC guidelines-New York's algorithm, 19 which relies exclusively on SOFA scores grouped into ranges, and a modified version of Colorado's algorithm, 20 which relies on SOFA score groupings plus comorbidities (Table 1) . We tested the performance of these representative CSC algorithms in discriminating 28-day inhospital mortality and in simulated clinical scenarios in which the algorithm selected one patient among a group of two to five patients. We focused on these two state guidelines because they represent two ends of the spectrum in considering comorbidities, with New York excluding consideration of comorbidities altogether and Colorado incorporating a broad range of preexisting conditions. Most, if not all, state algorithms take one of these approaches, with many state algorithms using a narrower range of comorbidities than Colorado. [1] [2] [3] We also tested a hypothetical algorithm of raw SOFA scores (not grouped into ranges) to assess the impact of the SOFA score ranges used by most states. We analyzed 2,722 patients who were intubated on the first day of ICU admission in the Study of the Treatment and Outcomes in Critically Ill Patients with COVID-19 (STOP-COVID) ( Figure S1 ), a multicenter cohort study of adult patients at 68 hospitals across the United States (Table S1 ). 21 The mean age (SD) was 61 (14) ( CSC algorithms assign "priority points" to estimate the likelihood of survival. Patients with fewer priority points are estimated to have a greater chance of survival, so these patients with lower priority scores are offered scarce resources. Nearly all verifiable CSC algorithms utilize the SOFA score, 1,2,7-10 a metric that assesses the level of dysfunction of six organ systems at the time of calculation. 22 For our analysis, we adapted the SOFA score calculation 9 to accommodate the data available in the STOP-COVID database, modifying the cardiovascular and central nervous system components (Table S2) . Most CSC algorithms do not assign priority points based on the raw SOFA score, but rather assign priority points to SOFA scores grouped into ranges. 1-2 For example, New York's algorithm assigns 1 point to patients with SOFA scores <7, 2 points for SOFA scores of 8-11, and 3 points for scores >11. When two patients receive the same number of priority points, New York employs a lottery to break the tie (Table 1) . 19 In contrast to New York, Colorado's algorithm assigns points to both SOFA score ranges and comorbidities. 20 Patients receive 1 priority point for SOFA scores <6, 2 points for scores 6-9, 3 points for scores 10-12, and 4 points for scores >12. Additional priority points are based on the Charlson Comorbidities Index. 23 Since we adapted the Charlson Comorbidity Index to available J o u r n a l P r e -p r o o f data (Tables S3-) , we refer to a "modified" Colorado model. When two patients receive the same number of priority points, Colorado first prioritizes children, healthcare workers, and first responders. If still tied, Colorado prioritizes younger patients, pregnant patients, and caretakers for the elderly. If still tied, Colorado uses a lottery. As a third approach, we used a hypothetical algorithm of raw SOFA scores that are not collapsed into ranges ( Table 1) . The primary outcome was 28-day in-hospital mortality, which we assessed for each patient subcohort defined by their CSC priority score at the time of ICU admission and intubation ( Figure 1A -C). All algorithms have an increased fraction of surviving patients in the "better" priority categories (i.e., lower priority point total) that would be prioritized for scarce resources. We next assessed the accuracy of each CSC algorithm in discriminating 28-day inhospital mortality by the area under the receiver operating characteristic (AUROC) curve. AUROC was 0.61 (95% CI, 0.59-0.63) for New York (i.e., SOFA score ranges), 0.67 (95% CI, 0.65-0.69) for Colorado (i.e., SOFA score ranges and comorbidities), and 0.64 (95% CI, 0.62-0.66) for raw SOFA scores (p<0.001, Figure 1D ). In a sensitivity analysis, we imputed missing SOFA score components with normal (i.e., healthy) values to assess our exclusion of patients with these components missing in their clinical record. We included an additional 594 patients and found similar trends in AUROC, with 0.59 (95% CI, 0.57-0.61) for New York, 0.65 (95% CI, 0.64-0.67) for Colorado , and 0.61 (95% CI, 0.59-0.63) for raw SOFA scores ( Figure S2A ). To investigate what may drive differences in performance among CSC algorithms, we conducted sensitivity analyses examining the effect of comorbidities and the effect of how SOFA scores are grouped into ranges. First, we assessed how each components of Colorado's J o u r n a l P r e -p r o o f algorithm-SOFA scores and comorbidity scores-perform on their own in discrimination of 28-day mortality ( Figure 1E ). The SOFA score component of Colorado's algorithm yielded an AUROC (95% CI) of 0.61 (0.59-0.63), and the comorbidities' component alone yielded an AUROC of 0.64 (0.62-0.66) with p<0.001, compared to 0.67 (0.65-0.69) for Colorado's complete algorithm with both SOFA and comorbidity components. The elements of SOFA and comorbidities are not independent. For example, in Colorado's algorithm, chronic kidney disease (CKD) might be "counted twice" for some patients as a comorbidity in the Charlson Comorbidity Index and as a marker of organ failure in the SOFA score. In a sensitivity analysis, we excluded renal disease from the Charlson Comorbidity Index for the 332 patients with CKD or end-stage renal disease (ESRD). For 28-day mortality, the AUROC (95% CI) was 0.62 (0.55-0.68) if CKD/ESRD were excluded from comorbidity scoring, compared to 0.61 (0.54-0.67) with CKD/ESRD included ( Figure S2B ). If CKD/ESRD were excluded from the scoring of comorbidities, the AUROC for the entire cohort of 2,272 patients was 0.67 (0.65-0.69), which was unchanged from the original analysis. New York's and Colorado's algorithms differ in both their approach to comorbidities and how they group SOFA score into ranges. Grouping SOFA scores into ranges is a common feature of state CSC algorithms, but the effect of grouping schemes on performance has not been studied empirically. We performed a sensitivity analysis to assess the effect of SOFA score grouping schemes on algorithm performance--for example, for SOFA scores in increments of two, with SOFA scores of 1 or 2 receiving a priority score of 1; SOFA scores of 3 or 4 receiving a priority score of 2, and so forth. The predictive accuracy of SOFA scores were insensitive to grouping in ranges of 1, 2, or 3. For SOFA score increments of one (i.e., the ungrouped, raw J o u r n a l P r e -p r o o f SOFA algorithm), the AUROC was 0.61 (0.59-0.63); for increments of two, 0.63 (0.61-0.65); and for increments of three, 0.62 (0.60-0.64) ( Figure S2C ). To examine how prioritization algorithms may function in clinical scenarios, we simulated selecting one patient to receive scarce resources out of groups of two or five patients using a bootstrap method. For each group of patients, a "winner" with the "best" priority score (i.e., lowest priority point total) was selected, and the "winner's" 28-day outcome (survivor or deceased) was noted. The group was considered "tied" if two or more patients tied for the "best" (lowest) priority point total. We performed 100 iterations of a computational simulation in which we randomly selected 1,000 groups of two or five patients. We excluded patient groups in which all the patients had the same outcome (i.e., all survivors or all deceased), since we cannot assess if the algorithm correctly selects a patient with a better outcome if all the patients in that group shared the same outcome. For each simulation of 1,000 patient groups, we calculated the percentage of groups for which the algorithm chose a patient who survived, and we computed the percentage of groups in which the algorithm required a tie-breaking lottery. These simulations yielded distributions for the percentage of algorithm decisions that selected a survivor or required a lottery. The results suggested that algorithms struggle with selecting one patient from a larger group, as all algorithms had worse performance in the groups of five patients, compared to groups of two patients. First, we examined the frequency of patient groups with tied priority scores that required a tie-breaker, such as a lottery (Table 3 , Column A). New York selected a patient without a lottery tie-breaker in 51% (95% CI, 47-55) of patient groups of two but selected a patient without a lottery tie-breaker in only 6% (4-7) of patient groups of five. That is, when J o u r n a l P r e -p r o o f selecting among a group of five patients, New York is almost a pure lottery, as 94% of the groups have tied priority scores requiring a lottery. For Colorado, the percentage of decisions made without requiring a tie-breaking lottery was 77% (95% CI, 74-80) and 58% (56-61) for patient groups of two and five, respectively. For our raw SOFA algorithm, the percentage of decisions made without lottery were 89% (95% CI, 87-92) and 78% (76-81) for groups of two and five, respectively ( Table 3 , Column A). In this simulation, we further examined those decisions which did not require a lottery tie-breaker. Among decisions not requiring a tie-breaker, we assessed whether the algorithm made the "correct" choice by prioritizing a patient with the better outcome (i.e., survival) ( Table 3 , Column B). New York chose a patient who survived in 72% (95% CI, 66-77) of patient groups of two and 64% (51-75) of groups of five. Colorado selected a surviving patient in 72% (95% CI, 68-76) and 74% (70-77) of decisions, for patient groups of two and five, respectively. For the raw SOFA algorithm selected a surviving patient in 65% (62-70) and 66% (63-69) of decisions, for patient groups of two and five, respectively. Thus, for groups of five patients, the algorithm incorporating comorbidities (Colorado) had superior accuracy in selecting a patient with the better outcome than the algorithm (New York) that only considered SOFA score ranges. We next calculated the percentage of "correct decisions" (i.e., selecting a surviving patient) across all decisions, whether the patient was selected by priority score or required a tie-breaking lottery ( Table 3 , Column C). In patient groups of two, Colorado (67% correct decisions) had better overall performance than New York (61% correct) because, while Colorado and New York shared the same accuracy among non-lottery decisions, Colorado had fewer decisions go to lottery, which is only 50% by chance (Table 3 , Column C). For patient groups of five, Colorado J o u r n a l P r e -p r o o f (70% correct) continued to out-perform New York (61% correct) with a combination of fewer lotteries and better accuracy in non-lottery decisions. Our simulations suggested that tie-breakers would be important and frequently employed in clinical practice. While no states use age as categorical exclusion criteria, many initial guidelines included age-categories as tie-breaker criteria. Many states have since moved away from specifying age as the main tie-breaker based on US Department of Health and Human Services guidance, instead considering age as part of individualized assessments when addressing tie-breakers. 24 We applied the tie-breaker based on age-categories in Colorado's guidelines to all three algorithms. Adding age as a tie-breaker improved algorithm performance (Table 3) . With age as the tie-breaker, New York selected a patient without a lottery in 90% (95% CI, 87-93) of decisions in patient groups of two and 68% (65-71) of decisions in groups of five, compared to 51% and 6%, respectively, without an age-based tiebreaker ( Table 3 , Column A). Of the decisions that did not require a lottery, New York chose the surviving patient in 70% (66-75) of pairs and 73% (69-76) of the groups of five, compared to 72% and 64%, respectively, without an age-based tie-breaker (Table 3 , Column B). With an age tie-breaker, Colorado selected a patient without lottery in 93% (95% CI 91-96) and 83% (80-85) for patient groups of two and five, respectively (increased from 77% and 58% without age tie-breakers) ( Table 3, Column A). In non-lottery decisions, Colorado had similar accuracy in choosing the surviving patient with or without age tie-breakers (Table 3 , Column B). Similar trends as Colorado were seen for the hypothetical raw SOFA score algorithm. For New York, the addition of age tiebreakers improved overall performance in selecting the "correct" patient from 61% to 71% (for patient groups of five, Table 3 , Column C). For Colorado and raw SOFA, the overall performance was minimally improved by only 1-2%. In summary, the key effect of an age tie-J o u r n a l P r e -p r o o f breaker is to increase the percentage of decisions made without lottery without altering the percentage of "correct" choices. Incorporating comorbidities has the potential to worsen algorithm performance for Black To examine the differences in algorithm performance by self-reported race and ethnicity, 1,468 patients were included in the analysis, with 867 (59%) White and 601 (41%) Black patients. For ethnicity, 1,956 patients were included, with 588 (30%) Hispanic/Latino and 1,368 non-Hispanic/Latino (70%). For Colorado, the AUROC curve for predicting 28-day mortality was 0.62 (95% CI, 0.57-0.66) for Black patients and 0.68 (0.65-0.72) for White patients (p<0.03). There were no statistically significant differences in AUROC curves for Colorado by ethnicity or New York by race or ethnicity ( Figure S3 ). To assess the clinical meaning of Colorado's modest differences in performance, we turned to our simulation of selecting a patient out of a small group. Colorado's algorithm consistently performed better for the subcohort of White patients than for Black patients. In the simulation of selecting one patient from a group of five patients, Colorado selected the patient with the better outcome in 71% of groups in the White subcohort, but only 63% in the Black subcohort (p<0.01) ( Table S4 , Column C). In contrast, New York selected the patient with the better outcome at similar rates for White and Black subcohorts (61% and 60%, respectively). Similar results were seen in groups of two and with an age tie-breaker. The hypothetical raw SOFA algorithm selected the "correct" patient more frequently for the White compared to Black subcohort (65% and 60%, respectively). New York's SOFA ranges reduced the race-dependent effects seen with raw (ungrouped) SOFA scores. However, even though Colorado's SOFA score groupings resemble New York's, J o u r n a l P r e -p r o o f In this multicenter, nationally-representative cohort study of critically ill patients with COVID-19, we found that both New York's (SOFA score groups) and Colorado's (SOFA score groups and comorbidities) algorithms had modest accuracy in discriminating 28-day mortality. In Colorado's algorithm, the addition of comorbidities modestly improved performance over the SOFA score alone. CSC algorithms greatly varied in the frequency of ties, which ranged from 11% to 94% depending on the scenario and algorithm (Table 3 ). Frequent ties are not necessarily a positive or negative feature. Some ethicists deem lotteries as the "most fair" method. However, when a state is selecting a CSC algorithm, the frequency of tied priority scores is an important performance characteristic to assess. Our results for the SOFA score component alone fall within the broad range seen in studies of SOFA scores for triage of critically ill patients without COVID-19. Prior studies had AUROC curve of 0.55-0.88 for the prediction of outcomes by SOFA scores. 2,25-29 Differences among studies were likely driven both by cohort characteristics and the design of triage algorithms. However, states have little empirical guidance on how to design algorithms. For example, a key design decision is how to group SOFA scores in ranges, if at all. In our sensitivity analysis, SOFA score grouping had little difference on predictive accuracy for 28-day mortality, although using raw (ungrouped) SOFA scores reduced ties. Further study is needed to better understand grouping strategies in patient cohorts with different distributions of SOFA scores and outcomes. We also found that the average SOFA scores for both New York's and Colorado's algorithms fall within the highest priority category (indicating the highest likelihood of survival), which may contribute to the modest performance of algorithms in discriminating outcomes. Our study and prior literature raise the question of whether the use of SOFA scores in J o u r n a l P r e -p r o o f CSC guidelines should be reconsidered. 9, 25, 27 Possible substitutes for SOFA scores may include blood laboratory values associated with outcomes in COVID-19 like C-reactive protein (CRP) or lactate levels. [30] [31] [32] A challenge is finding metrics that work for a mixed population with diagnoses ranging from sepsis to respiratory failure. If activated in this study cohort, the CSC algorithms would have denied scarce resources to many patients who would have survived and allocated resources to many who would have died. In the clinical scenario of selecting from small groups of patients, CSC algorithms "correctly" selected the patient with the better outcome in 65% to 74% of decisions. Whether 70% is an acceptable success rate depends on a variety of ethical and practical considerations. A state may decide against incorporating comorbidities, despite modestly worse overall performance, for simplicity or to avoid the potential for exacerbating racial disparities, although the possible relationship of race, ethnicity and CSC algorithm performance requires further study. [33] [34] [35] [36] [37] However, another state may include comorbidities since a modest improvement in performance may result in a meaningful number of lives saved when applied to many. It is vital that states empirically test triage algorithms to quantify whether an algorithm fulfills the ethical principle of maximizing lives saved and reaches acceptable thresholds for "real world" performance set by medical, lay and other communities. We have offered a framework for conducting such tests to ensure that CSC algorithms achieve the ethical principles they are designed to operationalize. Our study has several limitations. First, the study is limited to patients with COVID-19. CSC algorithms may be more or less predictive in other diseases, thus systematically advantaging or disadvantaging those with COVID-19. Second, the study may not generalize to patients who were intubated several days after ICU admission or the less common situation of intubation greater than 24 hours before ICU admission. More generally, this study did not distinguish patients who deteriorated early in their hospital course (e.g., hospital day 1) and those patients who deteriorated later in their hospital course after several days of non-critical illness. Third, due to data limitations, we used a modified version of the SOFA score, approximating two of the components (cardiovascular and central nervous system). Fourth, Although CSC were not activated at our study sites, the study cannot account for how the severity of the COVID-19 pandemic and individual illnesses may have influenced the decisions of individual clinicians regarding intubation and ICU triage; nor can our study account for changes to clinical practice, such as more intensive palliative care consultation, during different phases of the COVID-19 pandemic. Fifth, it is possible that scoring systems may perform differently now than in the spring of 2020 after the introduction of new therapies and improvement in outcomes. 38 Sixth, we examined two state guidelines representing the most common elements in state algorithms, but there are differences by state that may affect performance. Finally, we did not assess outcomes beyond 28 days, though our prior study found that the vast majority of deaths that occur among critically ill patients with COVID-19 occur in the first 28 days following ICU admission. 21 J o u r n a l P r e -p r o o f The New York Algorithm exclusion criteria include the following: 1) unwitnessed cardiac arrest, recurrent arrest without hemodynamic stability, arrest unresponsive to standard interventions and measures; trauma-related arrest, 2) irreversible age-specific hypotension unresponsive to fluid resuscitation and vasopressor therapy, 3) traumatic brain injury with no motor response to painful stimulus (i.e. best motor response = 1), 4) severe burns where predicted survival ≤ 10% even with unlimited aggressive therapy, and 5) any other conditions resulting in immediate or near-immediate mortality even with aggressive therapy. None of the patients in this cohort fell into this exclusion criteria. b Original and modified Comorbidity Index provided in Table S3 c Life Cycle groupings (age, years) for Colorado also used for Raw SOFA model: 0-49 = 1 (Highest Priority) | 50-59 = 2 | 60-69 = 3 | 70-79 = 4 | 80+ = 5 (Lowest Priority) Table 3 . CSC Algorithm performance in small group comparisons a A-C. Triage decisions by CSC algorithms in a simulation of 1,000 random groups of two or five patients. Column A. Percent of decisions that did not require a tie-breaker (i.e., two or more patients not tied for the "best" (lowest) priority score). Column B. Among the decisions not requiring tiebreakers, percent of decisions in which the algorithm selected a patient with the better outcome (i.e., survival). Column C. Percent of correct selections (i.e., selecting a surviving patient) across all decisions (i.e., all decisions regardless whether selected by priority score or requiring a tie-breaking lottery. a Unpaired t-tests were conducted to compare all algorithms (with and without age as a tiebreaker) to each other. Nearly all comparisons were significant at p<0.01. The only nonsignificant comparisons were New York versus Colorado for groups of two in column B, New York + age versus Colorado + age for groups of two and groups of five in Column B, and New York + age versus Colorado + age for groups of two and groups of five in Column C. *Indicates the algorithm that is closest state-guidelines. New York's algorithm as written in the state-guidelines does not utilize a tie-breaker, while Colorado's algorithm does. No new reagents or materials were generated as part of this study. Patient data reviewed in this study are not publicly available due to restrictions on patient privacy and data sharing. Individual, patient level data are not currently available because there are individual data use agreements with each of the 67 participating STOP-COVID institutions that do not permit sharing of individual patient data with outside entities. Summary data from STOP-COVID are publicly available in the prior publications, such as the following: Gupta This is a multicenter, retrospective cohort study, utilizing the previously published cohort the Study of the Treatment and Outcomes in Critically Ill Patients with COVID-19 (STOP-COVID), with inclusion, exclusion and data collection previously described in detail. 21 This study enrolled 4,717 consecutive adult patients with laboratory-confirmed COVID-19 admitted to ICUs at 68 hospitals across the United States from March 4 to June 17, 2020 (Table S1 ). 21 Inclusion criteria for the current manuscript were intubation on ICU day 1 and availability of data required to calculate SOFA scores within the first 48 hours of ICU admission. The SOFA score is a tool to assess the level of dysfunction of six organ systems, including respiratory function (ratio of the partial pressure of arterial oxygen to the fraction of inspired oxygen [PaO2 / FiO2]), coagulation (platelet count), liver function (total bilirubin), neurological function (Glasgow Coma Scale), cardiovascular function (number and dose of vasopressors), and renal function (serum creatinine and urine output). Colorado's algorithm altered the SOFA respiratory score to either pulse oximetry measurement of percent oxygen saturation (SpO 2 ) or the standard arterial blood gas measurement of percent arterial oxygen saturation (PaO 2 ). Of the STOP-COVID cohort, a total of 2,445 patients were excluded for lack of intubation, intubation later than ICU day 1, or lack of data to calculate SOFA scores ( Figure S1 ). Of 2,866 (20% of original cohort) patients intubated on ICU day 1, 594 patients were excluded due to insufficient clinical data to calculate SOFA score (Table S2 ). The analysis by race and ethnicity was restricted to patients who self-identified as Black or White, as other selfidentified categories had low numbers of patients (i.e., Asian, American Indian / Alaska Native, Native Hawaiian / Other Pacific Islander, More than One Race). For each patient, CSC priority points were calculated according to two state algorithms (New York and Colorado) and a hypothetical algorithm of raw SOFA scores not grouped into ranges (Table 1) . New York's algorithm grouped raw SOFA scores into three groups of ranges. Colorado's algorithm incorporated two components: raw SOFA scores grouped into four groups of ranges and comorbidities according to the Charlson Comorbidity Index. For this study, we adapted the Charlson Comorbidity Index to comorbidity data available in the STOP-COVID database (Tables S3-S4) , which we refer to as the "modified" Colorado model. The algorithms are described further in the Results section and in Table 1 . The primary outcome was 28-day inhospital mortality. Patients discharged alive from the hospital prior to 28 days were considered to be alive at 28 days. The validity of this assumption was verified in a subset of patients, as described elsewhere. 21 Data for the STOP-COVID cohort were collected by manual review of electronic health records as described previously. 21 Demographic data collected included age, gender, selfreported race and ethnicity, and comorbidities. Clinical data were collected at the time of ICU admission and included measurements of hemodynamics and oxygenation, respiratory and vasopressor support, and laboratory values. SOFA scores were calculated using data from ICU Day 1. Each ICU day is defined as a 24-hour period, from midnight to midnight. ICU Day 1 refers to the 24-hour period from the J o u r n a l P r e -p r o o f midnight prior to ICU admission to the midnight after ICU admission. If more than one lab value was available, the first value (i.e., first value recorded after midnight) was taken as the value for the 24-hour time period. If unavailable, data from ICU Day 2 were used. If no value was available on either ICU days 1 or 2, the following approach to missing data was followed: Patients were excluded from the analysis if they had missing data for the following components: PaO2 (161 patients), FiO2 (86 patients), platelets (37 patients), bilirubin (176 patients), altered mental status (310). Some patients had multiple missing values. No patients were excluded for a missing creatinine value. A total of 594 out of 2866 patients (20%) intubated on Day 1 of ICU admission were excluded based on lack of data availability. The SOFA score 9 was adapted to accommodate the STOP-COVID database (Table S2 ). The scoring of the SOFA cardiovascular component was adapted to the STOP-COVID registry which did not collect data on vasopressor dosage, only the number of vasopressors/inotropes administered each day. U.S. intensivists typically choose norepinephrine as the first vasopressor, so initiation of a vasopressor was scored as 3, corresponding to the scoring of norepinephrine initiation in standard SOFA scoring. The addition of a second vasopressor was scored as 4, since a second vasopressor is typically added only when norepinephrine dosage > 0.1. These adaptations eliminated the cardiovascular score of 1 (mean arterial pressure < 70), and we cannot exclude exceptions to the most common clinical practice in study sites. For the central nervous system (CNS) component, the Glasgow Coma Scale was approximated based on whether "altered mental status" (AMS) was indicated on the most recent physical exam prior to intubation. A score of "1" indicates that the patient had AMS, while a score of "0" indicates that the patient did not have AMS. A total of 310 patients who were marked as "data not available" were excluded. This adaptation lacks the range of CNS scoring in standard SOFA scoring. Normality was assessed using the Shapiro-Wilk test. Descriptive statistics were reported as mean (standard deviation) for normal distributions or median (interquartile range) for nonnormal distributions. Standard error was calculated using the method described by DeLong et al, 18 and confidence intervals were calculated with the exact binomial test. For continuous variables, unpaired Student's t tests (normal distribution) or Mann-Whitney U tests (non-normal distribution) were used for two-group comparisons. Area under the receiver operating characteristic (AUROC) curves were calculated to assess the accuracy of each CSC algorithm in discriminating 28-day in-hospital mortality. AUROCs were compared according to the method of DeLong et al. 40 To simulate a clinical scenario, we analyzed algorithm performance in small groups of two or five patients drawn at random from either the entire cohort (Table 3) or subcohorts defined by race (Table S5) . We performed 100 iterations of a computational simulation in which we randomly selected 1,000 groups of two or five patients. Patient groups were excluded in which all the patients had the same outcome (i.e., all survivors or all deceased), since we cannot assess if the algorithm correctly selects a patient with a better outcome if all the patients in that group shared the same outcome. Table 3 Column A: For each simulation of 1,000 patient groups, we calculated the percent of groups for which the algorithm had a single patient with the "best" (lowest) priority score, and so a tie-breaker, such as a lottery, was not required. Table 3 Column B: Among the patient groups that did not require a tie-breaker, we assessed whether the algorithm made a "correct decision," as defined by the selection of a surviving patient. Table 3 Column C: We calculated algorithm performance in making "correct decisions" (i.e., selecting a surviving patient) across all groups, that is the groups in column B (no tie-breaker needed) and the groups that required a tie-breaker. We further examined the effect of adding age as the 1st tie-breaker before lottery. Each simulation of 1,000 patient groups was iterated 100 times to generate a distribution. An unpaired t-test was used to calculate significant differences between the distributions. Statistical analysis was conducted in SPSS Statistics Version 25 (IBM) and R Version 3.6.1 (The R Project). The New York Algorithm exclusion criteria include the following: 1) unwitnessed cardiac arrest, recurrent arrest without hemodynamic stability, arrest unresponsive to standard interventions and measures; trauma-related arrest, 2) irreversible age-specific hypotension unresponsive to fluid resuscitation and vasopressor therapy, 3) traumatic brain injury with no motor response to painful stimulus (i.e. best motor response = 1), 4) severe burns where predicted survival ≤ 10% even with unlimited aggressive therapy, and 5) any other conditions resulting in immediate or near-immediate mortality even with aggressive therapy. None of the patients in this cohort fell into this exclusion criteria. b Original and modified Comorbidity Index provided in Table S3 . Table 3 . CSC Algorithm performance in small group comparisons a . A-C. Triage decisions by CSC algorithms in a simulation of 1,000 random groups of two or five patients. Column A. Percent of decisions that did not require a tiebreaker (i.e., two or more patients not tied for the "best" (lowest) priority score). Column B. Among the decisions not requiring tie-breakers, percent of decisions in which the algorithm selected a patient with the better outcome (i.e., survival). Column C. Percent of correct selections (i.e., selecting a surviving patient) across all decisions (i.e., all decisions regardless whether selected by priority score or requiring a tie-breaking lottery). Jezmir*, Bharadwaj* et al. J o u r n a l P r e -p r o o f Table of Contents: Table S1 . STOP-COVID Investigators and Participating Sites 3 Table S2 . SOFA ("sSOFA") score calculation. 6 Table S3 . Approach to Comorbidity Scoring in Colorado's algorithm 7 Table S4 . CSC algorithm performance in groups of two and five comparisons by race 9 Figure S1 . Study Cohort 11 Figure S2 . Sensitivity analysis for the association of priority scores or categories with 28-day mortality 12 Figure S3 . Performance of CSC algorithms according to race or ethnicity 13 Table S2 . SOFA ("sSOFA") score calculation. This study adapted standard SOFA scoring to the data in the clinical registry, as highlighted in Table S2A (grey). Since study sites typically utilized norepinephrine as the first vasopressor, thus the use of one vasopressor was assigned a score of 3, to correspond to the scoring for initiation of norepinephrine in standard SOFA scoring. The dataset allowed scoring of the presence or absence of altered mental status (AMS) but not the Glasgow Coma Score (GCS). J o u r n a l P r e -p r o o f Table S4 . CSC algorithm performance in groups of two or five patients by race In sub-cohorts of White or Black patients, the New York (NY) (SOFA score grouping only), modified Colorado (CO) (SOFA score groupings with comorbidities scoring) and a hypothetical algorithm of raw SOFA scores without grouping were examined in simulation of 1,000 random groups of two or five patients. Algorithms' "decisions" in selecting a "winning" patient or requiring a lottery tie-breaker were assessed. Column A. Percent of decisions that did not require tie-breakers (i.e., two or more patients not tied for the "best" (lowest) priority score). Column B. Among the decisions not requiring tie-breakers, percent of decisions in which the algorithm selected a patient with a better outcome (i.e., survival). Column C. Percent of correct selections (i.e., selecting a surviving patient) across all decisions (i.e., all decisions regardless whether selected by priority score or tiebreaking lottery). A. Percent decision not needing lottery tiebreaker Figure S2 . A-C: AUROC curves for discrimination of 28-day mortality by priority scores are shown for the following sensitivity analyses: A. The cohort was expanded to a total of 2,866 patients, which includes the 594 patients that were excluded due insufficient data to calculate SOFA scores in the original analysis ( Figure S1 ) and the 2,272 patient included in the original analysis. The three algorithms (New York, Colorado, Raw SOFA Scores) were applied to the expanded cohort. B. In the modified Colorado algorithm, CKD/ESRD could be "counted double," by contributing to both the SOFA scoring and the comorbidities scoring. The modified Colorado algorithm was compared to a version excluding CKD/ESRD from comorbidity scoring. C. Hypothetical algorithms of grouping SOFA scores in ranges of two or groupings in ranges of three were applied to the study cohort to generate priority points. The hypothetical algorithm of raw (ungrouped SOFA scores) was compared to the groupings of SOFA scores in ranges of 2 and ranges of 3. J o u r n a l P r e -p r o o f Figure S3 . Performance of CSC algorithms according to race or ethnicity Figure S3 . The state CSC or hypothetical raw SOFA score algorithms were applied to subcohorts defined by race or ethnicity to generate priority scores. The accuracy of priority scores in predicting 28-day mortality after ICU admission and intubation were assessed by AUROC curve. There were no statistically significant differences in AUROC for each algorithm across race or ethnicity (p>0.05). Variation in ventilator allocation guidelines by US state during the coronavirus disease 2019 pandemic: a systematic review US State Government Crisis Standards of Care Guidelines: Implications for Patients With Cancer Allocation of Scarce Resources in a Pandemic: A Systematic Review of US State Crisis Standards of Care Documents Crisis standards of care in a pandemic: navigating the ethical, clinical, psychological and policy-making maelstrom. International Journal for Quality in Health Care Fair Allocation of Scarce Medical Resources in the Time of Covid-19 A Framework for Rationing Ventilators and Critical Care Beds During the COVID-19 Pandemic Adult ICU Triage During the Coronavirus Disease 2019 Pandemic: Who Will Live and Who Will Die? Recommendations to Improve Survival. Critical care medicine At the Top of the Covid-19 Curve, How Do Hospitals Decide Who Gets Treatment? In-hospital cardiac arrest in critically ill patients with covid-19: multicenter cohort study. bmj The toughest triage-allocating ventilators in a pandemic Comparison of 2 Triage Scoring Guidelines for Allocation of Mechanical Ventilators Discriminant Accuracy of the SOFA Score for Determining the Probable Mortality of Patients With Pneumonia Requiring Mechanical Ventilation Race, socioeconomic status and health: complexities, ongoing challenges and research opportunities Racial and ethnic disparities in health care access and utilization under the Affordable Care Act Structural Racism and Supporting Black Lives -The Role of Health Professionals What We Don't Talk About When We Talk About Preventing Type 2 Diabetes Erratum in: JAMA Intern Med Racial differences in hypertension: implications for high blood pressure management access and disparities in kidney disease: chronic kidney disease hotspots and progress one step at a time NY guidelines for ventilators Colorado guidelines for scarce resources STOP-COVID Investigators. Factors associated with death in critically ill patients with coronavirus disease 2019 in the US The SOFA (Sepsis-related Organ Failure Assessment) score to describe organ dysfunction/failure. On behalf of the Working Group on Sepsis-Related Problems of the European Society of Intensive Care Medicine A new method of classifying prognostic comorbidity in longitudinal studies: development and validation Evaluation of SOFA-based models for predicting mortality in the ICU: A systematic review Serial evaluation of the SOFA score to predict outcome in critically ill patients Validation of SOFA Score in Critically Ill Patients with COVID-19 An assessment of the validity of SOFA score based triage in H1N1 critically ill patients during an influenza pandemic A retrospective cohort pilot study to evaluate a triage tool for use in a pandemic Prognostic factors for severity and mortality in patients infected with COVID-19: A systematic review Inflammatory Biomarker Trends Predict Respiratory Decline in COVID-19 Patients Protocol for assessing and predicting acute respiratory decline in hospitalized patients Crisis standards of care in the USA: a systematic review and implications for equity amidst COVID-19. Journal of racial and ethnic health disparities Mitigating Inequities and Saving Lives with ICU Triage During the COVID-19 Pandemic The Harm Of A Colorblind Allocation Of Scarce Resources Respecting Disability Rights-Toward Improved Crisis Standards of Care Inequity in Crisis Standards of Care Improving Survival of Critical Care Patients With Coronavirus Disease 2019 in England: A National Cohort Study Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach Crisis Standards of Care (CSC) guidelines have poor prediction of 28-day mortality 2. Consideration of comorbidities modestly improves guideline performance show that crisis standards of care (CSC) guidelines, used to allocate scarce medical resources, poorly discriminate 28-day mortality and result in frequently tied priority scores. The authors present a framework for testing CSC guidelines to ensure they meet their stated ethical goals Beth