key: cord-0919828-5m4760l9
authors: Soto-Mota, A.; Marfil-Garza, B. A.; Castiello, S.; Martinez-Rodriguez, E.; Carrillo-Vazquez, D.; Tadeo Espinoza, H.; Guerrero-Cabrera, J. P.; Dardon-Fierro, F. E.; Escobar Valderrama, J. M.; Alanis-Mendizabal, J.; Gutierrez-Mejia, J.
title: Prospective predictive performance comparison between Clinical Gestalt and validated COVID-19 mortality scores.
date: 2021-04-18
journal: nan
DOI: 10.1101/2021.04.16.21255647
sha: d6c18f00bc87eb259919756a2e5decfdb33d698b
doc_id: 919828
cord_uid: 5m4760l9

ABSTRACT Background: Most COVID-19 mortality scores were developed in the early months of the pandemic and now available evidence-based interventions have helped reduce its lethality. It has not been evaluated if the original predictive performance of these scores holds true nor compared it against Clinical Gestalt predictions. We tested the current predictive accuracy of six COVID-19 scores and compared it with Clinical Gestalt predictions. Methods: 200 COVID-19 patients were enrolled in a tertiary hospital in Mexico City between September and December 2020. Clinical Gestalt predictions of death (as a percentage) and LOW-HARM, qSOFA, MSL-COVID-19, NUTRI-CoV and NEWS2 were obtained at admission. We calculated the AUC of each score and compared it against Clinical Gestalt predictions and against their respective originally reported value. Results: 106 men and 60 women aged 56+/-9 and with confirmed COVID-19 were included in the analysis. The observed AUC of all scores was significantly lower than originally reported; LOW-HARM 0.96 (0.94-0.98) vs 0.76 (0.69-0.84), qSOFA 0.74 (0.65-0.81) vs 0.61 (0.53-0.69), MSL-COVID-19 0.72 (0.69-0.75) vs 0.64 (0.55-0.73) NUTRI-CoV 0.79 (0.76-0.82) vs 0.60 (0.51-0.69), NEWS2 0.84 (0.79-0.90) vs 0.65 (0.56-0.75), Neutrophil-Lymphocyte ratio 0.74 (0.62-0.85) vs 0.65 (0.57-0.73). Clinical Gestalt predictions were non-inferior to mortality scores (AUC=0.68 (0.59-0.77)). Adjusting the LOW-HARM score with locally derived likelihood ratios did not improve its performance. However, some scores performed better than Clinical Gestalt predictions when clinician's confidence of prediction was <80%. Conclusion: No score was significantly better than Clinical Gestalt predictions. Despite its subjective nature, Clinical Gestalt has relevant advantages for predicting COVID-19 clinical outcomes.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 18, 2021. ;  https://doi.org/10.1101/2021.04. 16 .21255647 doi: medRxiv preprint

Many prediction models have been developed for COVID-19 (1) (2) (3) (4) (5) and their applications in healthcare range from bed-side counseling to triage systems (6). However, most have been developed within specific clinical contexts (1, 2) or validated with data from the early months of the pandemic (4, 5) . Since then, health systems have implemented protocols and adaptations to cope with a surge in hospitalization rates (7), and now, clinicians have more knowledge and experience for managing these patients. Additionally, other non-biological factors like critical-care availability have been found to strongly influence the prognosis of 9) . These frequently intangible factors (e.g., the experience of the staff with specific healthcare tasks) impact prognosis but are ignored by mortality scores.

Prediction models are context-sensitive (10), therefore, to preserve their accuracy they must be applied in contexts as similar as possible to the ones where they were derived from.

Considering that healthcare systems and settings are quite different around the world, there are many examples of scores requiring adjustments or local adaptations (11, 12) .

Predicting is an every-day activity in most medical fields and, in other scenarios, clinician's subjective predictions have been observed to be as accurate as mathematically derived models (13) (14) (15) ). However, the opposite has been observed as well, for example, clinicians tend to overestimate the long-term survival of oncologic patients (16) . This work aimed to compare the predictive performance of different mortality prediction models for COVID-19

(some of them in the same hospital they were developed) against their original performance and Clinical Gestalt predictions.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted April 18, 2021 

To test the hypothesis that the predictive performance of already validated scores declined over time, we chose the LOW-HARM (4), MSL-COVID-19, and NUTRI-CoV (5) scores because all three were validated with data from Mexican COVID-19 patients. To rule out that this was a phenomenon exclusive of scores developed with Mexican data, we reevaluated the accuracy of the NEWS2 (1) and qSOFA (2) scores, and the Neutrophil:Lymphocyte ratio to predict mortality from COVID-19 (17) .

Clinical Gestalt predictions and all necessary data to calculate the prognostic scores were obtained at hospital admission from October to December 2020. The Internal Medicine residents in charge of collecting the clinical history, physical examination and the initial imaging and laboratory workup were asked after their all the initial imaging and laboratory reports were available: 1) How likely do you think it is this patient will die from COVID-19? (as a percentage).

2) How confident are you of that prediction? (as a percentage).

Additionally, to test the hypothesis that updating the statistical weights of a score with local data could help preserve its original accuracy, we developed a second version of the LOW-HARM score (LOW-HARM score v2) using positive and negative likelihood ratios derived from cohorts of Mexican patients (4,8) (instead of only positive likelihood ratios from Chinese patients (18, 19) as in the original version).

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted April 18, 2021 Finally, to test the hypothesis that scores outperformed Clinical Gestalt predictions when their confidence was "low" (below or equal to the median perceived confidence (i.e., < 80%), we conducted a comparative AUC analysis of cases below or above this threshold. 

We calculated with "easyROC" (20) , an open R-based web-tool for estimating sample sizes for AUC direct and non-inferior comparisons using Obuchowski's method (21) that; for detecting no-inferiority with a >0.05 maximal AUC difference with the reported LOW- . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 18, 2021. ; https://doi.org/10.1101/2021.04.16.21255647 doi: medRxiv preprint Setting A tertiary hospital in Mexico City, fully dedicated to COVID-19 healthcare between October and December 2020.

Data from 200 consecutive hospital admissions (with an RT-PCR confirmed COVID-19 infection) were obtained between October and December of 2020. We excluded from the analysis all patients without a documented clinical outcome (e.g., hospitalized at the moment of data collection, transferred to another hospital, voluntary discharge). A total of 166 patients were included in the analysis because 34 patients were either transferred to other hospitals or voluntarily discharged.

Clinical and demographic data were analysed using mean or median (depending on their distribution) and standard deviation or interquartile range (IQR) as dispersion measures.

Shapiro-Wilk tests were used for assessing if variables were normally distributed. R version 4.0.3 using the packages "caret" for confusion matrix calculations, "pROC" for ROC analysis, and STATA v12 software were used for statistical analysis. The AUCs differences were analysed using DeLong's method with the STATA function "roccomp" (22). A p value of <0.05 for inferring statistical significance was for all statistical tests.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 18, 2021. ; https://doi.org/10.1101/2021.04.16.21255647 doi: medRxiv preprint

We include 166 patients in our study. Of these, 47 (28.3%) were deaths and 119 (71.7%) were survivors. General demographics and clinical characteristics of these populations are shown in Table 1 . As expected, decreased peripheral saturation, ventilatory support, cardiac injury, renal injury, leukocytosis, and lymphocytosis were more prevalent in the group of patients that died during their hospitalization. Table 2 shows the median scores and their IQR for each prediction tool. As expected, there was a more pronounced mean difference between groups in scores that were based on a 100-point scale (clinical gestalt, LOW-HARM scores). Table 2 shows the originally reported AUC vs the AUC we observed in our data. Figure 1 shows the performance characteristics of the selected predictive models. Overall, we found a statistically significant difference between predictive models (p=0.002).

However, we did not find statistically significant differences between Clinical Gestalt and other prediction tools.

As expected, we found that the confidence of prediction increased in cases in which the predicted probability of death was clearly high or clearly low (Figure 2) . We found a moderate-strong, bimodal, correlation between the confidence of prediction and the predicted probability of death at a <50% predicted probability of death (Pearson's R=0.60, p<0.0001) and at a >50% predicted probability of death (Pearson's R=0.50, p=0.0002).

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) We further explored the performance characteristics of the selected predictive models in specific contexts (Appendix Table 1 ). Figure 3 shows the results of the analysis including cases in which the certainty of prediction was below and above 80%. Overall, we found a statistically significant difference between predictive models in both settings. In cases in which the confidence of prediction was < 80%, both versions of the LOW-HARM scores showed a larger AUC compared to Clinical Gestalt (Figure 3b and Appendix Table 1 ).

An additional analysis restricted to cases in which the certainty of prediction was <80% and the predicted probability of death was <30% (i.e., median value for all cases) found a statistically significant difference between predictive models (p=0.0005). Similarly, individual comparisons showed a larger AUC statistically significant differences between Clinical Gestalt and both versions of the LOW-HARM score (Appendix Table 1 ).

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Additionally, we explored the accuracy of Clinical Gestalt across different degrees of prediction confidence. To our knowledge, this is the first time that this type of analysis is done for subjective clinical predictions and proved to be quite insightful. The fact that Clinical Gestalt's accuracy correlates with confidence in prediction, suggests that while there is value in subjective predictions, it is also important to ask ourselves about how confident we are about our predictions. Interestingly, our results suggest Clinical Gestalt predictions are particularly prone to be positively biased, clinicians were more likely to correctly predict which patients would survive than which patients would die (Figure 2 and Supplementary Figure 1) . This is consistent with other studies that have found that clinicians tend to overestimate the effectiveness of their treatments and therefore, patient survival (16) .

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 18, 2021. Since it is expected that scores will lose at least some of their predictive accuracy when used outside the context they were developed in, it has already been reported that local adaptations improve or help retain their predictive performance. In this work, we tried to evaluate if by updating the likelihood ratio values used in the calculation of the LOW-HARM score with data from Mexican patients we could mitigate its loss of accuracy.

However, despite the AUC of the LOW-HARM score v2 being slightly larger than the AUC of the original LOW-HARM score, the difference was not statistically significant nor significantly more accurate than Clinical Gestalt predictions. This highlights the fact that scores are far from being final or perfect tools even after implementing local adjustments.

Even when some of the results in this study can prove insightful for other clinical settings and challenges, our results cannot be widely extrapolated due to the local setting of our work and the highly heterogenous nature of COVID-19 healthcare systems. Additionally, it is likely that emerging variants, vaccination, or the seasonality of the contagion waves (23) will continue to influence the predictive capabilities of all predictive models. Additionally, our sample size was calculated to detect non-inferiority between prediction methods.

Specifically designed studies are needed to better investigate the relationship between subjective confidence, accuracy, and positive bias.

Clinical predictions will always be challenging because all medical fields are in constant development and clinical challenges are highly dynamic phenomena. Despite its inherent subjectivity, Clinical Gestalt immediately incorporates context specific factors and, in contrast with statistically derived models, it is likely to improve its accuracy over time.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. 

All authors contributed significantly to the design analysis and reporting of this study. Dr Adrian Soto-Mota is the guarantor of this study and takes responsibility for the contents of this article.

Innovation Center. All authors wish to thank the invaluable support of the National Institute of Medical Sciences and Nutrition Salvador Zubirán Emergency Department staff.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 18, 2021 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted April 18, 2021 

NEWS2 is a valuable tool for appropriate clinical management of COVID-19 patients

Predictive performance of SOFA and qSOFA for in-hospital mortality in severe novel coronavirus disease

Neutrophil-tolymphocyte ratio as a predictive biomarker for moderate-severe ARDS in severe COVID-19 patients

External validation of the Revised Cardiac Risk Index and National Surgical Quality Improvement Program Myocardial Infarction and Cardiac Arrest calculator in noncardiac vascular surgery

Evaluation and improvement of the National Early Warning Score (NEWS2) for COVID-19: a multi-hospital study

SURvival PRediction In SEverely Ill Patients Study-The Prediction of Survival in Critically Ill Patients by ICU Physicians. Critical Care Explorations

Scores to predict major bleeding risk during oral anticoagulation therapy: A prospective validation study

International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted

Diagnostic accuracy of physician's gestalt in suspected COVID-19: Prospective bicentric study

The accuracy of clinicians' predictions of survival in advanced cancer: A review

Annals of Palliative Medicine

Neutrophil-to-Lymphocyte Ratio Predicts Severe Illness Patients with 2019 Novel Coronavirus in the Early Stage

Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: a retrospective cohort study. The Lancet

An interpretable mortality prediction model for COVID-19 patients

EasyROC: An interactive web-tool for roc curve analysis using r language environment

International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity

The authors report no conflict of interests. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 18, 2021 

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 18, 2021. ; https://doi.org/10.1101/2021.04.16.21255647 doi: medRxiv preprint 23 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 18, 2021. ; https://doi.org/10.1101/2021.04.16.21255647 doi: medRxiv preprint 24 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)The copyright holder for this preprint this version posted April 18, 2021. ; https://doi.org/10.1101/2021.04.16.21255647 doi: medRxiv preprint