key: cord-0857759-w1d6boqc authors: Neto, Felippe Lazar; Marino, Lucas Oliveira; Torres, Antoni; Cilloniz, Catia; Meirelles Marchini, Julio Flavio; Garcia de Alencar, Julio Cesar; Palomeque, Andrea; Albacar, Núria; Brandão Neto, Rodrigo Antônio; Souza, Heraldo Possolo; Ranzani, Otavio T. title: Community-acquired Pneumonia Severity Assessment Tools in Patients Hospitalized with COVID-19: a Validation and Clinical Applicability Study date: 2021-04-02 journal: Clin Microbiol Infect DOI: 10.1016/j.cmi.2021.03.002 sha: 9e4459504eed1da6d02b17114992b161702b1592 doc_id: 857759 cord_uid: w1d6boqc OBJECTIVE: To externally validate community acquired pneumonia (CAP) tools on patients hospitalized with COVID-19 pneumonia from two distinct countries, and compare its performance to recently developed COVID-19 mortality risk stratification tools. METHODS: We evaluated 11 risk stratification scores in a binational retrospective cohort of patients hospitalized with COVID-19 pneumonia in São Paulo and Barcelona: Pneumonia Severity Index (PSI), CURB, CURB-65, qSOFA, Infectious Disease Society of America and American Thoracic Society Minor Criteria, REA-ICU, SCAP, SMART-COP, CALL, COVID GRAM and 4C. The primary and secondary outcomes were 30-day in-hospital mortality and seven-day intensive-care unit (ICU) admission respectively. We compared their predictive performance using the area under the ROC curve (AUROC), sensitivity, specificity, likelihood ratios, calibration plots and decision curve analysis. RESULTS: Of 1363 patients, the mean (SD) age was 61 (16) years. The 30-day in-hospital mortality rate was 24.6% (228/925) in São Paulo and 21.0% (92/438) in Barcelona. For in-hospital mortality, we found higher AUROCs for PSI (0.79, 95%CI 0.77–0.82), 4C (0.78, 95%CI 0.75–0.81), COVID GRAM (0.77, 95%CI 0.75–0.80), and CURB-65 (0.74 95%CI 0.72–0.77). Results were similar for both countries. For most 1-20% threshold range in decision curve analysis, PSI would avoid a higher number of unnecessary interventions, followed by the 4C score. All scores had poor performance (AUROC<0.65) for seven-day ICU admission. CONCLUSIONS: Recent clinical COVID-19 assessment scores had comparable performance to standard pneumonia assessment tools. Because it is expected that new scores outperform older ones during development, external validation studies are needed before recommending their use. The SARS-CoV-2 virus has infected more than 110 million people and killed nearly 2.5 million worldwide [1] . Although most patients have mild limited symptoms, 15% complain of dyspnea, and 5% present with hypoxemic respiratory failure, shock, or multiorgan dysfunction [2] . Identifying patients who will need advanced support or are at high risk of poor outcomes challenges physicians. To help decision-making, researchers developed several risk assessment tools specifically for COVID-19; however, most scores had important limitations during development: poor report, over-optimism and high risk of bias [3, 4] . In addition, external validation is needed before implementation in routine clinical practice. Community-acquired pneumonia (CAP) is a common infection and a leading cause of mortality [5, 6] . Over the past decades, risk stratification tools improved CAP clinical management [7] . Unlike COVID-19 prediction rules, CAP scores were extensively validated [8] with some already evaluated on COVID-19 with promising results [9] [10] [11] . We evaluated CAP and COVID-19 risk assessment scores on a binational cohort of hospitalized patients with COVID-19 pneumonia in São Paulo and Barcelona during the initial pandemic surge. We hypothesized that CAP prediction rules would have similar performance to COVID-19 recently developed ones. We retrospectively analyzed patients with COVID-19 pneumonia admitted to the emergency department (ED) of two university hospitals: Hospital das Clínicas (from March 14 th to June 14 th ) and Hospital Clinic (from February 28 th to May 5 th ). Both hospitals were designated to be the tertiary reference for COVID-19 suspected cases in their respective cities: São Paulo J o u r n a l P r e -p r o o f (Brazil) and Barcelona (Spain). Both ethics committees approved the studies protocols (CAAE 30417520.0.0000.0068 and Register HCB/2020/0273). We defined COVID-19 pneumonia as a new infection-compatible infiltrate on lung-CT or chest x-ray associated with acute inferior respiratory tract infection symptoms. All patients were admitted and treated according to the institutional protocol. A real-time-reverse-transcriptasepolymerase-chain-reaction (RT-qPCR) from the upper (nasopharyngeal or oropharyngeal) or lower (endotracheal) respiratory specimens was collected to confirm SARS-CoV-2 infection. A We applied the following risk assessment scores according to admission variables: Pneumonia Severity Index (PSI) [12] , CURB [13] , CURB-65 [13] , IDSA/ATS Minor Criteria [14] , quick Sepsis Related Organ Failure Assessment (qSOFA) [15, 16] , Severe Community Acquired Pneumonia (SCAP) [17] , SMART-COP [18] , The Risk of Early Admission to ICU index (REA-ICU) [19] , COVID-GRAM [20] , CALL [21] a nd 4C [22] . We used their original descriptions (Supplemental Appendix 2). The cut-off values for each score were chosen based on the J o u r n a l P r e -p r o o f development report if available or the standard use. A 10% risk threshold was selected for COVID-GRAM based on similar risk prediction tools [18, 19] . We considered the need for supplemental oxygen therapy or peripheral oxygen saturation < 92% equivalent to documented laboratory hypoxemia (pO2<60 mmHg) when deriving scores that included hypoxemia. Variables that were not in the database and consequently could not be imputed were assigned zero for risk calculation and are specified in Supplemental Appendix 2. Our primary outcome was in-hospital mortality at 30 days. Patients still hospitalized at 30-days were considered alive. The secondary outcome was admission to intensive care unit (ICU) until the seventh-day (excluding those on mechanical ventilation or vasoactive drugs before hospital admission). Mean, standard deviation (SD), median and interquartile range (IQR) were used for descriptive statistics according to variable distribution. We defined a priori the statistical analysis plan. We expected a great proportion of missing values due to the large number of risk scores tested and the wide range of different variables considered by each score. We performed a single imputation procedure with chained equations, assuming a missing at random pattern (MAR), in which missing values are conditional on measured variables. We used predictive mean matching (PMM) due to its flexibility for imputation of different types of variables [23] . Outcome and country were included as predictors during the imputation process. The Supplemental Table 1 provides the missing percentage descriptive statistics. Model predictive performance was assessed with the area under the receiver operating characteristic curve (AUROC) and the Brier score. The Brier score is an overall model fit metric, combining both discrimination and calibration aspects. The Brier score is better when the values are closer to 0 ("perfect model"). Calibration was evaluated using calibration plots sub-divided in quintiles of predicted probabilities. Clinical utility was analyzed using sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), positive likelihood ratio (PLR), and negative likelihood ratio (NLR). Confidence intervals (95%) were calculated after 1000 bootstrap re-samples. To incorporate the clinical decision reasoning in model evaluation, we used the decision curve analysis framework [24] where predictive models can be compared to common strategies of treating all or none of the patients. To accomplish this, we calculated the net-benefit (NB) for each strategy by subtracting the proportion of false positives (FP) from the true positives (TP) one weighted by the relative harm of a false-positive and a false-negative result. In short, we take into account how much the physician is willing to treat more false positives patients to avoid not treating true negatives patients [24, 25] . The net benefit of treated patients (NBt) is the result of subtracting the NB of the evaluated model from the NB of the treating all strategy. This number is then used for computing the number of avoidable interventions (NAI) per 100 patients. For this study, intervention would be optimization of hospital resources (intensity of care) which is the ultimate decision goal when applying mortality risk stratification tools at the emergency department. We restricted probability thresholds in-between 0.01 and 0.2 (99:1 and 4:1 false-positive/false-negative weights respectively) as is commonly done for infectious diseases, including pneumonia. The decision curve analysis and calibration were restricted to in-hospital mortality. We followed the Transparent Reporting of a Multivariable Prediction for Individual Prognosis or Diagnosis (TRIPOD) framework [26] . All statistical analysis were done in R version 3.6.2. . The São Paulo cohort had more patients above the cutoffs for all scores. For all prediction assessment tools, a point increase was followed by an increase in the observed mortality rate (Figure 1 ). Overall performance is shown in Table 2 Table 3 ). Table 5 and 6) shows PSI with the highest AUROC in both; however, most scores had better performance in Barcelona compared to São Paulo. Overall, calibration was good (Supplemental Figure 2 and 3) . Higher sensitivities were found for 4C, CALL and CURB; and higher specificities for qSOFA and CURB-65. All scores had poor AUROC (Table 2) Table 7 and 8). 4C and CALL had higher sensitivities while qSOFA had the highest specificity. J o u r n a l P r e -p r o o f Figure 2 shows the decision curve analysis for in-hospital mortality. PSI had the best net benefit for most tested thresholds (1-20%), followed by the 4C score. At a probability threshold of 5% (NWT of 20), PSI is the best strategy as it would avoid 6.2 interventions per 100 screened patients ( Table 3) . As the probability threshold increases, the best strategies change: at a 10% threshold (NWT 10), PSI, 4C and CURB would avoid 15.8, 14.6 and 11.9 interventions (per 100 patients) respectively; at a 20% threshold, SCAP would avoid 28.3, 4C 28.0 and PSI 27.9 interventions. We observed that classical CAP severity scores' predictive performance is comparable to recent COVID-19 developed ones in 1363 hospitalized patients with COVID-19 pneumonia in Brazil and Spain. PSI and the recent 4C score had comparable performances in all evaluations. Among the tested scores, results were consistent for both cohorts, which are expected to have significant unmeasured differences regarding treatment and risk factors to poor outcomes because of socio-economic discrepancies in the underlying population (upper-middle income country versus a high-income country). PSI had the highest performance compared to other prediction rules regardless of the country origin and our results are comparable to those found in similar pandemic scenarios [9, 10] . A possible explanation is that PSI heavily weights on comorbidities and age that are known strong independent mortality risk factors for COVID-19 [27] . The same reasoning applies to the 4C score. On both, a 71-year old man is classified as intermediate risk based solely on age and gender regardless of any other information. The methodological rigor during PSI development which included a large sample size helped building a robust model that was vastly validated in the literature [8, 28, 29] . The use of risk stratification scores in clinical practice requires analyzing the decision curve. PSI would be the best strategy in our cohort for threshold probabilities ≤ 5%. However, such a low threshold would only be reasonable in a scenario with little risk of overcrowding; a context that does not apply to many countries during this pandemic. Moreover, because there is no specific treatment yet known for COVID-19, higher intensity of care on patients not at high risk of death may increase nosocomial infections and other related-complications without necessarily decreasing mortality. For thresholds in-between 6% and 20%, PSI and 4C had the highest net-benefit throughout the range. Although hospitalization is often unavoidable even with low predicted mortality risk (e.g. need for oxygen therapy), these instruments can help manage hospital limited resources by suggesting referral for higher-or lower-complexity facilities. Deciding which assessment tool to apply involves not only the decision curve, but previous validations, generalizability, tests' availability and estimation complexity. The higher number of required variables for PSI's calculation can make it time-consuming and unrealistic in under-resourced or overwhelmed scenarios. By contrast, qSOFA, a simpler tool that relies only on three clinical variables and consequently is widely applicable, had poor overall performance and an unexpected low sensitivity and high specificity for a screening tool -findings in line with similar studies in CAP [16] . Nevertheless, qSOFA had the highest positive predictive value, which still place it as a risk-stratification tool to be further evaluated. Alternatives with reasonable performance, lower number of required tests and easy estimation potentially applicable to low-resource settings are the CURB-65 (mixed clinical variables and urea) and the 4C score (mixed clinical variables, urea and c-reactive protein). CURB-65 has some advantages over the 4C as it was extensively validated in different scenarios [8] and is already part of routine risk assessment for CAP in many emergency departments. Remarkably, none of the evaluated scores performed well for seven-day ICU admission. Scores that were developed aiming this particular outcome such as SMART-COP, SCAP and REA-ICU presented better overall performance in our cohort. SCAP and SMART-COP had the highest sensitivities among the three (81% and 80%, respectively) and better ability to exclude the outcome when negative (NPV≥75%). However, most patients were over the threshold at admission and therefore still on reasonable risk of ICU admission: 73 and 77 out of 100 admitted patients had SMART-COP≥3 and SCAP≥10 respectively. Although both 4C≥4 and CALL≥6 had high sensitivities and NPV, they included over 92% of admitted patients making them less useful. One possible explanation for the under-performance of CAP scores is that they rely on image findings (unilateral, multi-lobar or bilateral) that are known to affect prognosis in CAP [30] but are still unknown in COVID-19. Overall, these results coincide with those found for mortality in that new scores for COVID-19 had similar performance compared to CAP ones. Our study has limitations. First, our results may not apply to secondary or primary settings as both medical centers were tertiary. Second, because we did not evaluate outpatients, the current external validation cannot support these scores for COVID-19 triage and thus future studies are needed to clarify their clinical applicability in this setting. Additionally, our study comprises the initial pandemic surge in both countries, subject to the learning curve of COVID-19 treatment and high demand to health system, limiting our conclusions to similar scenarios. Third, we included patients with clinical COVID-19 diagnosis in Brazil because of RT-qPCR shortage during early pandemic. However, our sensitivity analysis including only RT-qPCR showed comparable predictive performance. Finally, it is a challenge to apply risk stratification tools in tertiary referenced settings as previous treatments may lead to underscoring at admission (e.g. use of anti-pyretic medications and temperature at admission). Despite these limitations, the present study provides a validation of several scores already described for CAP that could help physicians address patients' safety and manage hospital J o u r n a l P r e -p r o o f resources. Among its strengths, our study has shown consistent validation results for cohorts from two countries with distinct socio-economic, ethnic, and demographic backgrounds. In summary, standard CAP risk assessment scores performance was comparable to three recently developed COVID-19 mortality risk stratification tools. It is expected that new scores outperform older ones during development because they are often trained and tested in similar datasets. Therefore, more external validation studies are needed to ensure generalizability before recommending their use. J o u r n a l P r e -p r o o f Characteristics of and Important Lessons From the Coronavirus Disease 2019 (COVID-19) Outbreak in China: Summary of a Report of 72 314 Cases From the Chinese Center for Disease Control and Prevention Prediction models for diagnosis and prognosis in Covid-19 Prediction models for diagnosis and prognosis of covid-19 infection: systematic review and critical appraisal Hospital volume and 30-day mortality for three common medical conditions Diagnosis and Treatment of Adults with Community-acquired Pneumonia. An Official Clinical Practice Guideline of the Severity assessment tools for predicting mortality in hospitalised patients with communityacquired pneumonia. Systematic review and meta-analysis Comparison of severity scores for COVID-19 patients with pneumonia: a retrospective study Performance of pneumonia severity index and CURB-65 in predicting 30-day mortality in patients with COVID-19 Systematic evaluation and external validation of 22 prognostic models among hospitalised adults with COVID-19: An observational cohort study A prediction rule to identify low-risk patients with community-acquired pneumonia Defining community acquired pneumonia severity on presentation to hospital: an international derivation and validation study Validation and clinical implications of the IDSA/ATS minor criteria for severe community-acquired pneumonia Assessment of Clinical Criteria for Sepsis: For the Third International Consensus Definitions for Sepsis and Septic Shock (Sepsis-3) New Sepsis Definition (Sepsis-3) and Community-acquired Pneumonia Mortality. A Validation and Clinical Decision-Making Study Development and validation of a clinical prediction rule for severe community-acquired pneumonia SMART-COP: a tool for predicting the need for intensive respiratory or vasopressor support in communityacquired pneumonia Risk stratification of early admission to the intensive care unit of patients with no major criteria of severe community-acquired pneumonia: development of an international prediction rule Development and Validation of a Clinical Risk Score to Predict the Occurrence of Critical Illness in Hospitalized Patients With COVID-19 Prediction for Progression Risk in Patients with COVID-19 Pneumonia: the CALL Score Risk stratification of patients admitted to hospital with covid-19 using the ISARIC WHO Clinical Characterisation Protocol: development and validation of the 4C Mortality Score Clinical Prediction Models Decision curve analysis: a novel method for evaluating prediction models A simple, step-by-step guide to interpreting decision curve analysis Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): the TRIPOD statement Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: a retrospective cohort study. The lancet Calculating the sample size required for developing a clinical prediction model The pneumonia severity index: a decade after the initial derivation and validation Multilobar bilateral and unilateral chest radiograph involvement: implications for prognosis in hospitalised community-acquired pneumonia