key: cord-1046558-t5cdwpo4
authors: Gershengorn, Hayley B.; Patel, Samira; Shukla, Bhavarth; Warde, Prem R.; Soorus, Shane M.; Holt, Gregory E.; Kett, Daniel H.; Parekh, Dipen J.; Ferreira, Tanira
title: Predictive Value of Sequential Organ Failure Assessment Score across Patients with and without COVID-19 Infection
date: 2022-05-01
journal: Annals of the American Thoracic Society
DOI: 10.1513/annalsats.202106-680oc
sha: 1e297ec7a9bf032d697c928892348893ac580e40
doc_id: 1046558
cord_uid: t5cdwpo4

RATIONALE: Sequential organ failure assessment (SOFA) scores are commonly used in crisis standards of care policies to assist in resource allocation. The relative predictive value of SOFA by coronavirus disease (COVID-19) infection status and among racial and ethnic subgroups within patients infected with COVID-19 is unknown. OBJECTIVES: To evaluate the accuracy and calibration of SOFA in predicting hospital mortality by COVID-19 infection status and across racial and ethnic subgroups. METHODS: We performed a retrospective cohort study of adult admissions to the University of Miami Hospital and Clinics inpatient wards (July 1, 2020–April 1, 2021). We primarily considered maximum SOFA within 48 hours of hospitalization. We assessed accuracy using the area under the receiver operating characteristic curve (AUROC) and created calibration belts. Considered subgroups were defined by COVID-19 infection status (by severe acute respiratory syndrome coronavirus 2 polymerase chain reaction testing) and prevalent racial and ethnic minorities. Comparisons across subgroups were made with DeLong testing for discriminative accuracy and visualization of calibration belts. RESULTS: Our primary cohort consisted of 20,045 hospitalizations, of which 1,894 (9.5%) were COVID-19 positive. SOFA was similarly accurate for COVID-19–positive (AUROC, 0.835) and COVID-19–negative (AUROC, 0.810; P = 0.15) admissions but was slightly better calibrated in patients who were positive for COVID-19. For those with critical illness, maximum SOFA score accuracy at critical illness onset also did not differ by COVID-19 status (AUROC, COVID-19 positive vs. negative: intensive care unit admissions, 0.751 vs. 0.775; P = 0.46; mechanically ventilated, 0.713 vs. 0.792, P = 0.13), and calibration was again better for patients positive for COVID-19. Among patients with COVID-19, SOFA accuracy was similar between the non-Hispanic White population (AUROC, 0.894) and racial and ethnic minorities (Hispanic White population: AUROC, 0.824 [P vs. non-Hispanic White = 0.05]; non-Hispanic Black population: AUROC, 0.800 [P = 0.12]; Hispanic Black population: AUROC, 0.948 [P = 0.31]). This similar accuracy was also found for those without COVID-19 (non-Hispanic White population: AUROC, 0.829; Hispanic White population: AUROC, 0.811 [P = 0.37]; Hispanic Black population: AUROC, 0.828 [P = 0.97]; non-Hispanic Black population: AUROC, 0.867 [P = 0.46]). SOFA was well calibrated for all racial and ethnic groups with COVID-19 but estimated mortality more variably and performed less well across races and ethnicities without COVID-19. CONCLUSIONS: SOFA accuracy does not differ by COVID-19 status and is similar among racial and ethnic groups both with and without COVID-19. Calibration is better for COVID-19–infected patients and, among those without COVID-19, varies by race and ethnicity.

The importance of accurate predictions of short-term mortality in the setting of acute illness has become clear during the coronavirus disease (COVID-19) pandemic. Crisis standards of care (CSC) policies create frameworks to allocate life-saving resources when demand exceeds supply. Most such policies are based, at least in part, on expected short-term patient survival with the sequential organ failure assessment (SOFA) score (1) being commonly used for mortality predictions (2) (3) (4) .

The accuracy of SOFA for predicting mortality in the setting of COVID-19 has been found to vary. In Wuhan, China, early in the pandemic, SOFA was shown to have a poor predictive accuracy (area under the receiver operating characteristic curve [AUROC], 0.69) for hospitalized patients with COVID-19 (5), but to be excellent for critically ill patients with COVID-19 (AUROC, 0.89) (6, 7) . In the United States, SOFA scores had excellent predictive accuracy for hospitalized patients with COVID-19 (AUROC, 0.85) and performed even better in a larger cohort inclusive of patients without COVID-19 (AUROC, 0.90) (8) . Whether the accuracy of SOFA differs for patients based on COVID-19 status is unknown.

In diverse U.S. populations of hospitalized patients without COVID-19, SOFA accuracy has been demonstrated to vary little across racial and ethnic groups (9, 10) . However, in this same study, SOFA was shown to be miscalibrated, overestimating mortality among individuals of Black race and underestimating it for White people. Whether this differential miscalibration exists among individuals with COVID-19 is uncertain.

Recognizing that use of any CSC policy during the COVID-19 pandemic will necessarily affect patients of all racial and ethnic groups both with and without COVID-19, we sought to directly compare the predictive accuracy and calibration of SOFA across these groups. We hypothesized that SOFA would perform similarly for patients with and without COVID-19 but would perform less well for persons of non-White race and/or Hispanic ethnicity independent of COVID-19 status.

We performed a retrospective cohort study of admissions to the University of Miami Hospital and Clinics inpatient wards from July 1, 2020, to April 1, 2021. This hospital system consists of three inpatient facilities: a general tertiary care academic hospital (550 beds), a cancer-specialty hospital (40 beds), and an ophthalmology-care hospital (75 beds).

Our primary cohort consisted of all admissions who were discharged by April 1, 2021 (as patients discharge after then would be missing hospital mortality data). Admissions were excluded if they were less than 18 years old or had missing SOFA data. We considered two secondary cohorts: 1) patients admitted to an ICU; and 2) patients who received invasive mechanical ventilation (MV).

Our exposure was maximum SOFA score. Starting on June 15, 2020, we implemented an automated SOFA score calculation into our electronic health record (Epic). Total SOFA scores (range 0 [best]-24 [worst]) as well as each organ system-based component (respiratory, cardiovascular, renal, liver, hematological, and neurological; range 0 [best]-4 [worst] each) were calculated and recorded hourly for all admissions throughout their hospital stay (Table E1 in the online supplement). Owing to incomplete documentation of urine output, the renal component of the SOFA score was based solely on creatinine; all patients with an active order for dialysis were given a renal SOFA score of 4 (worst value). When arterial blood gas testing was not available, conversion of oxygen saturation by pulse oximetry to an estimated partial pressure of oxygen in the arterial blood was used (11) .

Our primary exposure was maximum SOFA score within 48 hours following hospitalization for the full cohort. We evaluated two additional exposures for sensitivity analyses, within 24 hours and within 72 hours of hospitalization. For the subcohort of ICU patients, we considered maximum SOFA within 48 hours surrounding ICU admission (defined as within 24 hours before and 24 hours after ICU admission), recognizing that early ICU care may change illness trajectory such that need for invasive mechanical ventilation, a main resource for potential allocation, may be altered. As a post hoc sensitivity analysis, we also considered maximum SOFA in solely the 24 hours before ICU admission. For the subcohort of MV patients, we considered maximum SOFA within 48 hours prior to MV initiation (12) .

We described our cohort using standard summary statistics. Comparisons across SOFA groupings commonly used in CSC policies (SOFA ,6, 6-8, 9-11, or >12 [3, 4, 13] ) were performed using t and chi-square testing. Model discrimination was assessed using the AUROC for SOFA predictions of hospital mortality (inclusive of patients who died during hospitalization or were discharged to hospice). Model calibration was assessed through evaluation of hospital mortality rates across SOFA groupings as well as calibration belts, which visually display type, range, and magnitude of miscalibration (10, 14) . As SOFA score is used without adjustment for other covariables as a predictor of short-term mortality in CSC policies, no adjustment for other covariables was included in either analysis.

SOFA discrimination and calibration were first calculated for the full cohort. Each was then calculated for patients with and without COVID-19; by hospital protocol, all patients received a severe acute respiratory syndrome coronavirus 2 ; all race and ethnicity designations were provided by the patient or their family member and captured in the electronic health record. Categorization of discrimination accuracy used a previously defined framework (AUROC for poor, ,0.7; acceptable, 0.7-0.8; excellent, 0.8-0.9; or outstanding, .0.9) (7, 10), and comparisons of discrimination were made by DeLong testing. Likelihood ratio testing was used to assess statistical differences from perfect calibration (10, 14) . Comparison of mortality rates across SOFA groupings was made by chi-square testing.

Post hoc, we decided to evaluate the accuracy among COVID-19 admissions of the individual components of the SOFA score: respiratory, cardiovascular, liver, kidney, and coagulation; we did not include neurological as it is less consistently captured and, thus, often assumed to be normal. For those components with at least acceptable accuracy, we investigated whether accuracy or calibration differed by race and ethnicity.

We repeated the above total SOFA score assessments of discrimination and calibration for each secondary cohort (ICU and MV patients). We then conducted sensitivity analyses using the alternate timeframes for maximum SOFA (24 h and 72 h) and, separately, an alternate definition of hospital mortality (reclassifying patients discharged to hospice as survivors). Finally, we evaluated the differential accuracy of SOFA score near the time of critical illness onset for the ICU and MV cohorts across racial and ethnic groups.

This study was approved by the Institutional Review Board of the University of Miami (#20200739). P values were considered significant if less than 0.05; correction for multiple comparisons was not used, and, therefore, all secondary analyses should be considered hypothesis generating.

Statistical analyses were performed using R version 3.6.2.

Our primary cohort consisted of 20,045 hospitalizations, of which 1,894 (9.5%) were COVID-19 positive ( Figure E1 and Table 1 [6] [7] [8] 337 (1.7%) SOFA 9-11, and 210 (1.1%) SOFA 12 or more. There were substantial differences in the maximum SOFA score across COVID-19 positivity (P , 0.001) and racial and ethnic groups (P , 0.001). Admissions with lower maximum SOFA scores were more likely to be COVID-19-positive (9.2% of all patients with a SOFA lower than 6, 13.9% of those with a SOFA of 6-8, and 9.5% of those with a SOFA of 9-11 vs. 6.7% of those with a SOFA of 12 or more were COVID-19-positive) and of non-Hispanic Black race and ethnicity (19.0% of all patients with a SOFA lower than 6 and 21.5% of those with a SOFA of 6-8 vs. 8.9% of those with a SOFA of 9-11 and 14.3% of those with a SOFA of 12 or more were of non-Hispanic Black race and ethnicity).

Maximum SOFA score within 48 hours of hospital admission had excellent accuracy in predicting hospital mortality for the full cohort (AUROC, 0.820) ( Figure 1A ). SOFA was similarly accurate for COVID-19-positive admissions (AUROC, 0.835) ( Figure 1B ) and those without COVID-19 infection (AUROC, 0.810; P = 0.15) ( Figure 1C ).

Increasing SOFA score was associated with increasing mortality rates for the full cohort from 2.9% for SOFA 6 or lower to 33.8% for SOFA 12 or more ( Table 2) ; similar trends were seen in patients with and without COVID-19. SOFA was poorly calibrated for the full cohort, substantially underestimating mortality for those at low risk and overestimating mortality for those at more moderate risk of death ( Figure 1D ).

Calibration was good for patients positive for COVID-19 ( Figure 1E ) but for patients without COVID-19 infection resulted in similar under and overestimations of mortality as for the full cohort (not unexpected, as patients negative for COVID-19 comprised 90.5% of the full cohort) ( Figure 1F ).

SOFA accuracy among admissions with COVID-19 varied by race and ethnicity, ranging from excellent for non-Hispanic Black people (AUROC, 0.800) to outstanding for Hispanic Black people (AUROC, 0.948) (Figures 2A-2D ). The accuracy of SOFA did not differ statistically between non-Hispanic White admissions (AUROC, 0.894) and those of any other racial and ethnic group: Hispanic White (AUROC, 0.824; P = 0.05); non-Hispanic Black (P = 0.12); or Hispanic Black (P = 0.31). Mortality rates increased as SOFA score increased for each race and ethnic group, and good calibration (albeit with lower confidence for non-Hispanic and Hispanic Black patients) was observed for all races and ethnicities.

SOFA accuracy among admissions without COVID-19 also varied by race and ethnicity but was excellent across groups (AUROC for non-Hispanic Mortality rates for patients negative for COVID-19 were less consistently correlated with SOFA score; for non-Hispanic patients with higher illness severity (SOFA > 9), increased SOFA score was not coincident with increased mortality, although sample size was small. Calibration was poor for most racial and ethnic groups with underestimation of mortality for patients without COVID-19 at lower risk and overestimation for those at more moderate risk of death. Calibration was better for Hispanic than for non-Hispanic patients.

Among admissions with COVID-19, the accuracy of the respiratory component of the SOFA score was excellent (AUROC, 0.843), and that for the cardiovascular component was acceptable (AUROC, 0.720); the ORIGINAL RESEARCH 

Black patients (AUROC, 0.939). Compared with non-Hispanic White patients (AUROC, 0.902), accuracy was worse for Hispanic White patients (AUROC 0.840; P = 0.019) and non-Hispanic Black patients (P = 0.038), but not Hispanic Black patients (P = 0.29) ( Figure E3 ). Accuracy of the cardiovascular component also varied by race and ethnicity but was overall lower, ranging from poor for Hispanic Black patients (AUROC, 0.534) to excellent for non-Hispanic White patients (AUROC, 0.816). Compared with non-Hispanic White patients, accuracy was lower for Hispanic White patients (AUROC, 0.696; P = 0.008) and Hispanic Black patients (P = 0.038), but not for non-Hispanic Black patients (AUROC, 0.712; P = 0.12) ( Figure E4 ). Both respiratory and cardiovascular SOFA components were largely well calibrated (albeit with large confidence intervals) for each racial and ethnic group.

For the 2,862 evaluable ICU admissions (Table E2) , SOFA accuracy was acceptable (AUROC, 0.772) ( Figure 3A ) as it was for the 629 evaluable admissions receiving MV (AUROC, 0.781) (Table E3 and Figure 3B ). Accuracy was not significantly different for either cohort based on COVID-19 positivity ( Figures 3C-3F) ; however, in both cohorts, accuracy compared with non-Hispanic White patients (ICU cohort AUROC, 0.817; MV cohort, 0.871) was lower in racial and ethnic minorities (ICU cohort, Hispanic White AUROC, 0.746; P = 0.004; MV cohort, Hispanic White AUROC, 0.769; P = 0.012; non-Hispanic Black AUROC, 0.719; P = 0.014) ( Figure E5 ). SOFA was miscalibrated (both over-and underestimating mortality) for ICU admissions both with and without COVID-19 in patterns similar to those seen for the hospitalized cohort. Calibration for MV patients, independent of COVID-19 status, was better. In a sensitivity analysis of the ICU cohort evaluating maximum SOFA within only the 24 hours prior to admission, SOFA had a slightly improved accuracy (AUROC, 0.813), yet differences were found by COVID-19 status (COVID-19-positive AUROC, 0.695, vs. COVID-19-negative AUROC, 0.815; P = 0.001); better, but still imperfect, calibration was observed ( Figure E6 ).

Our results were robust to timeframe of maximum SOFA score. For the full cohort, accuracy of maximum SOFA within 24 hours and 72 hours of hospitalization was acceptable, and accuracy was similar for COVID-19-positive and COVID-19-negative admissions ( Figures E7 and E8) . Miscalibration, which was more pronounced for patients without COVID-19, was observed for both SOFA scores. Reclassifying hospice discharges as survivors led to slightly improved accuracy of the maximum SOFA within 48 hours of hospital admission (our primary exposure, AUROC, full cohort, 0.862; COVID-19-positive, 0.843; and COVID-19-negative, 0.859) and similar calibration ( Figure E9 ).

We The accuracy of SOFA in the setting of COVID-19 has been a subject of concern because of its use in CSC resource-allocation policies (2) . Our findings of excellent SOFA accuracy (AUROC, 0.835) among COVID-19-positive hospitalizations are consistent with those of Sottile and colleagues using data from Colorado (AUROC, 0.85) (8). Ma and colleagues noted a substantially poor accuracy for COVID-19 

hospitalizations (5), yet their cohort was based in China and included patients from early on in the pandemic whose treatment and outcomes likely differed from those infected with COVID-19 more recently.

We found SOFA to be of similarly excellent accuracy (AUROC, 0.810) among admissions without COVID-19. Although not previously compared head to head, this accuracy is lower than the outstanding accuracy found by Sottile and colleagues (AUROC, 0.90) in a cohort of hospitalizations inclusive of patients both with and without COVID-19 (8) . In this latter study, however, the SOFA scores evaluated were the maximum at any point during hospitalization and not just within 

the first 48 hours following admission. It is not surprising that SOFA defined in this manner has enhanced accuracy as it incorporates more data and may include high scores occurring just prior to deaths. Whether predictions based on maximum SOFA defined in this way might differentially impact admissions based on COVID-19 status is unknown.

CSC policies must not create new or enhance existing biases against minority individuals. As such, SOFA accuracy across different racial and ethnic groups is of particular concern. For use in the current pandemic specifically, SOFA scores must be unbiased for both patients with and without COVID-19 as both populations may be "at risk" for triage away from potentially lifesaving resources if supply were limited. To our knowledge, ours is the first study to compare and find similar SOFA accuracy and calibration across racial and ethnic groups of COVID-19-positive hospital admissions. A recent study of patients negative for COVID-19 with sepsis or respiratory failure found that maximum SOFA in the emergency department was more accurate for Black (AUROC, 0.72) versus White (AUROC, 0.67; P , 0.05) patients and was notably differentially miscalibrated, overestimating mortality for Black and underestimating it for White patients (10) . These findings contrast with ours of excellent SOFA accuracy for all racial and ethnic subgroups without COVID-19 infection and a qualitatively similar underestimation (for low-risk patients) and overestimation (for more moderate-risk patients) of mortality for both White and Black patients. Although cohort inclusion criteria and exposure definitions differed between the studies, it is not clear whether these factors are sufficient to explain the disparate results. Rather, the discrepancies suggest more study is required to understand how SOFA behaves in our typical hospitalized patient. Also in need of confirmatory study are our novel findings 

that: among patients with COVID-19, SOFA is as accurate for Black and/or Hispanic individuals as it is for non-Hispanic White patients; among patients without COVID-19, calibration is better for Hispanic than non-Hispanic individuals; and, among all patients irrespective of COVID-19 status, accuracy of SOFA at the time of critical illness onset is lower for racial and ethnic minority patients than it is for non-Hispanic White patients.

The main strength of our study stems from our diverse cohort inclusive of admissions both with and without COVID-19, which 1) mimics the population that would be exposed to a CSC policy for resource allocation; and 2) allows for direct comparison of SOFA predictive value across subgroups. Limitations arise, however, from several areas. First, our cohort is confined to admissions within a single healthcare system in a uniquely diverse region of the United States, potentially limiting generalizability. Specifically, it is not known whether the experiences of Black and/or Hispanic patients in the South Florida area differ from those of similar patients in other parts of the country (e.g., South Florida has an abundance of Spanish-speaking clinicians, which may mitigate some aspects of disparate care). Moreover, COVID-19related practices (e.g., use of high-flow nasal cannula or noninvasive positive pressure ventilation) likely vary substantially between hospitals. Second, although we never instituted our CSC policy, our hospitals were under strain to varying degrees throughout the period of study; if and how such strain affected care and might confound our results is unknown. Third, although rich in ethnic and Black/White racial diversity, our cohort consisted of few individuals of other racial minorities, making evaluation of SOFA accuracy in these groups impossible. Fourth, differential mortality rates may help explain the differential calibration observed across racial and ethnic groups. However, such differences in mortality will likely exist in the real-world settings in which SOFA may be applied as part of CSC.

Finally, SOFA and critical illnesses are both dynamic. Although our results were robust to different timeframes for defining maximum SOFA after hospitalization, the degrading accuracy of SOFA assessed at the onset of critical illness, when decisions about resource allocation may be required during CSC, is concerning. Moreover, patient subgroups may experience different disease courses (e.g., time to mortality may differ by COVID-19 status [ Figure E10 ]). How consideration of critical illness dynamicity and SOFA trends over the course of illness might impact the predictive value of SOFA is unclear but is of great import if it remains integral to many CSC policies, especially if SOFA trajectory has a differential impact across patient subgroups.

Our findings add to a growing literature showing that SOFA may perform differently in predicting short-term mortality, specifically owing to its variable calibration, across patient subgroups (e.g., by disease type or race and ethnicity). SOFA was developed in 1996 with the express purpose of understanding the "natural history of organ dysfunction" and the "effects of new therapies" and, as noted specifically, "not to predict outcome" (1). In the intervening decades, SOFA has been widely used in research to account for illness severity and, more recently, as the cornerstone of resource allocation for many CSC policies. Prediction tools are best if they are accurate. However, perhaps more importantly, if they are to underpin life-or-death decisions, they must also be precise; similar performance across all patient groups is imperative because real-world resource allocation will never be limited to isolated subgroups but, instead, will be considered for all patients at once. CSC policies aim to ensure fair and equitable resource allocation in times of shortage, yet reliance on an imprecise predictor of short-term mortality may undermine this mission. Whether a single predictor (e.g., SOFA) can achieve this goal or if a tool comprised of different predictors for different subgroups is required remains to be determined.

The SOFA (sepsis-related organ failure assessment) score to describe organ dysfunction/failure. On behalf of the Working Group on Sepsis-related Problems of the European Society of Intensive Care Medicine

Crisis standards of care in the USA: a systematic review and implications for equity amidst COVID-19

Assessment of disparities associated with a crisis standards of care resource allocation algorithm for patients in 2 US hospitals during the COVID-19 pandemic

Comparison of 2 triage scoring guidelines for allocation of mechanical ventilators

Development and validation of a new prognostic scoring system for COVID-19

Predictive performance of SOFA and qSOFA for in-hospital mortality in severe novel coronavirus disease

Receiver operating characteristic curve in diagnostic test assessment

Real-time electronic health record mortality prediction during the COVID-19 pandemic: a prospective cohort study

Performance of intensive care unit severity scoring systems across different ethnicities in the USA: a retrospective observational study

Equitably allocating resources during crises: racial differences in mortality prediction models

National Institutes of Health, National Heart, Lung, and Blood Institute ARDS Network. Comparison of the SpO2/FIO2 ratio and the PaO2/FIO2 ratio in patients with acute lung injury or ARDS

Discriminant accuracy of the SOFA score for determining the probable mortality of patients with COVID-19 pneumonia requiring mechanical ventilation

Allocation of scarce critical care resources during a public health emergency

Assessing the calibration of dichotomous outcome models with the calibration belt

Author disclosures are available with the text of this article at www.atsjournals.org.