key: cord-0976568-sckfkciz authors: Gupta, Rishi K.; Marks, Michael; Samuels, Thomas H. A.; Luintel, Akish; Rampling, Tommy; Chowdhury, Humayra; Quartagno, Matteo; Nair, Arjun; Lipman, Marc; Abubakar, Ibrahim; van Smeden, Maarten; Wong, Wai Keong; Williams, Bryan; Noursadeghi, Mahdad title: Systematic evaluation and external validation of 22 prognostic models among hospitalised adults with COVID-19: An observational cohort study date: 2020-09-25 journal: Eur Respir J DOI: 10.1183/13993003.03498-2020 sha: bc4ab266ab0eb2d56cf2efc92c8205dd0a2d7cf6 doc_id: 976568 cord_uid: sckfkciz BACKGROUND: The number of proposed prognostic models for COVID-19 is growing rapidly, but it is unknown whether any are suitable for widespread clinical implementation. METHODS: We independently externally validated the performance candidate prognostic models, identified through a living systematic review, among consecutive adults admitted to hospital with a final diagnosis of COVID-19. We reconstructed candidate models as per original descriptions and evaluated performance for their original intended outcomes using predictors measured at admission. We assessed discrimination, calibration and net benefit, compared to the default strategies of treating all and no patients, and against the most discriminating predictor in univariable analyses. RESULTS: We tested 22 candidate prognostic models among 411 participants with COVID-19, of whom 180 (43.8%) and 115 (28.0%) met the endpoints of clinical deterioration and mortality, respectively. Highest areas under receiver operating characteristic (AUROC) curves were achieved by the NEWS2 score for prediction of deterioration over 24 h (0.78; 95% CI 0.73–0.83), and a novel model for prediction of deterioration <14 days from admission (0.78; 0.74–0.82). The most discriminating univariable predictors were admission oxygen saturation on room air for in-hospital deterioration (AUROC 0.76; 0.71–0.81), and age for in-hospital mortality (AUROC 0.76; 0.71–0.81). No prognostic model demonstrated consistently higher net benefit than these univariable predictors, across a range of threshold probabilities. CONCLUSIONS: Admission oxygen saturation on room air and patient age are strong predictors of deterioration and mortality among hospitalised adults with COVID-19, respectively. None of the prognostic models evaluated here offered incremental value for patient stratification to these univariable predictors. Coronavirus disease 2019 , caused by severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2), causes a spectrum of disease ranging from asymptomatic infection to critical illness. Among people admitted to hospital, COVID-19 has reported mortality of 21-33%, with 14-17% requiring admission to high dependency or intensive care units (ICU) [1] [2] [3] [4] . Exponential surges in transmission of SARS-CoV-2, coupled with the severity of disease among a subset of those affected, pose major challenges to health services by threatening to overwhelm resource capacity [5] . Rapid and effective triage at the point of presentation to hospital is therefore required to facilitate adequate allocation of resources and to ensure that patients at higher risk of deterioration are managed and monitored appropriately. Importantly, prognostic models may have additional value in patient stratification for emerging drug therapies [6, 7] . As a result, there has been global interest in development of prediction models for COVID-19 [8] . These include models aiming to predict a diagnosis of COVID-19, and prognostic models, aiming to predict disease outcomes. At the time of writing, a living systematic review has already catalogued 145 diagnostic or prognostic models for COVID-19 [8] . Critical appraisal of these models using quality assessment tools developed specifically for prediction modelling studies suggests that the candidate models are poorly reported, at high risk of bias and over-estimation of their reported performance [8, 9] . However, independent evaluation of candidate prognostic models in unselected datasets has been lacking. It therefore remains unclear how well these proposed models perform in practice, or whether any are suitable for widespread clinical implementation. We aimed to address this knowledge gap by systematically evaluating the performance of proposed prognostic models, among consecutive patients hospitalised with a final diagnosis of COVID-19 at a single centre, when using predictors measured at the point of hospital admission. We used a published living systematic review to identify all candidate prognostic models for COVID- 19 indexed in PubMed, Embase, Arxiv, medRxiv, or bioRxiv until 5 th May 2020, regardless of underlying study quality [8] . We included models that aim to predict clinical deterioration or mortality among patients with COVID-19. We also included prognostic scores commonly used in clinical practice [10] [11] [12] , but not specifically developed for COVID-19 patients, since these models may also be considered for use by clinicians to aid risk-stratification for patients with COVID- 19 . For each candidate model identified, we extracted predictor variables, outcome definitions (including time horizons), modelling approaches, and final model parameters from original publications, and contacted authors for additional information where required. We excluded scores where the underlying model parameters were not publicly available, since we were unable to reconstruct them, along with models for which included predictors were not available in our dataset. The latter included models that require computed tomography imaging or arterial blood gas sampling, since these investigations were not routinely performed among unselected patients with COVID-19 at our centre. Our study is reported in accordance with transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) guidance for external validation studies [13] investigations were routinely performed. Data were collected by direct extraction from electronic health records, complemented by manual curation. Variables of interest in the dataset included: demographics (age, gender, ethnicity), comorbidities (identified through manual record review), clinical observations, laboratory measurements, radiology reports, and clinical outcomes. Each chest radiograph was reported by a single radiologist, who was provided with a short summary of the indication for the investigation at the time of request, reflecting routine clinical conditions. Chest radiographs were classified using British Society of Thoracic Imaging criteria, and using a modified version of the Radiographic Assessment of Lung Edema (RALE) score [14, 15] . For each predictor, measurements were recorded as part of routine clinical care. Where serial measurements were available, we included the measurement taken closest to the time of presentation to hospital, with a maximum interval between presentation and measurement of 24 hours. For models that used ICU admission or death, or progression to 'severe' COVID-19 or death, as composite endpoints, we used a composite 'clinical deterioration' endpoint as the primary outcome. We defined clinical deterioration as initiation of ventilatory support (continuous positive airway pressure, non-invasive ventilation, high flow nasal cannula oxygen, invasive mechanical ventilation or extra-corporeal membrane oxygenation) or death, equivalent to World Health Organization Clinical Progression Scale  6 [16] . This definition does not include standard oxygen therapy. We did not apply any temporal limits on (a) the minimum duration of respiratory support; or (b) the interval between presentation to hospital and the outcome. The rationale for this composite outcome is to make the endpoint more generalisable between centres, since hospital respiratory management algorithms may vary substantially. Defining the outcome based on level of support, as opposed to ward setting, also ensures that it is appropriate in the context of a pandemic, when treatments that would usually only be considered in an ICU setting may be administered in other environments due to resource constraints. Where models specified their intended time horizon in their original description, we used this timepoint in the primary analysis, in order to ensure unbiased assessment of model calibration. Where the intended time horizon was not specified, we assessed the model to predict in-hospital deterioration or mortality, as appropriate. All deterioration and mortality events were included, regardless of their clinical aetiology. Participants were followed-up clinically to the point of discharge from hospital. We extended follow-up beyond discharge by cross-checking NHS spine records to identify reported deaths post-discharge, thus ensuring >30 days' follow-up for all participants. For each prognostic model included in the analyses, we reconstructed the model according to authors' original descriptions, and sought to evaluate the model discrimination and calibration performance against our approximation of their original intended endpoint. For models that provide online risk calculator tools, we validated our reconstructed models against original authors' models, by cross-checking our predictions against those generated by the web-based tools for a random subset of participants. For all models, we assessed discrimination by quantifying the area under the receiver operating characteristic curve (AUROC) [17] . For models that provided outcome probability scores, we assessed calibration by visualising calibration of predicted vs. observed risk using loess-smoothed plots, and by quantifying calibration slopes and calibration-in-the-large (CITL CITL, but allowed us to examine the calibration slope in our dataset. We also assessed the discrimination of each candidate model for standardised outcomes of: (a) our composite endpoint of clinical deterioration; and (b) mortality, across a range of pre-specified time horizons from admission (7 days, 14 days, 30 days and any time during hospital admission), by calculating time-dependent AUROCs (with cumulative sensitivity and dynamic specificity) [18] . The rationale for this analysis was to harmonise endpoints, in order to facilitate more direct comparisons of discrimination between the candidate models. In order to further benchmark the performance of candidate prognostic models, we then computed AUROCs for a limited number of univariable predictors considered to be of highest importance a priori, based on clinical knowledge and existing data, for prediction of our composite endpoints of clinical deterioration and mortality (7 days, 14 days, 30 days and any time during hospital admission). The a priori predictors of interest examined in this analysis were age, clinical frailty scale, oxygen saturation at presentation on room air, C-reactive protein and absolute lymphocyte count [8, 19] . Decision curve analysis allows assessment of the clinical utility of candidate models, and is dependent on both model discrimination and calibration [20] . We performed decision curve analyses to quantify the net benefit achieved by each model for predicting the intended endpoint, in order to inform clinical decision making across a range of risk:benefit ratios for an intervention or 'treatment' [20] . In this approach, the risk:benefit ratio is analogous to the cut point for a statistical model above which the intervention would be considered beneficial (deemed the 'threshold probability'). Net benefit was calculated as sensitivity × prevalence -(1specificity) × (1prevalence) × w where w is the odds at the threshold probability and the prevalence is the proportion of patients who experienced the outcome [20] . We calculated net benefit across a range of clinically relevant threshold probabilities, ranging from 0 to 0.5, since the risk:benefit ratio may vary for any given intervention (or 'treatment'). We compared the utility of each candidate model against strategies of treating all and no patients, and against the best performing univariable predictor for in-hospital clinical deterioration, or mortality, as appropriate. To ensure that fair, head-to-head net benefit comparisons were made between multivariable probability based models, points score models and univariable predictors, we calibrated each of these to the validation dataset for the purpose of decision curve analysis. Probability-based models were recalibrated to the validation data by refitting logistic regression models with the candidate model linear predictor as the sole predictor. We calculated 'delta' net benefit as net benefit when using the index model minus net benefit when: (a) treating all patients; and (b) using most discriminating univariable predictor. Decision curve analyses were done using the rmda package in R [21] . We handled missing data using multiple imputation by chained equations [22] , using the mice package in R [23] . All variables and outcomes in the final prognostic models were included in the imputation model to ensure compatibility [22] . A total of 10 imputed datasets were generated; discrimination, calibration and net benefit metrics were pooled using Rubin's rules [24] . All analyses were conducted in R (version 3.5.1). We recalculated discrimination and calibration parameters for each candidate model using (a) a complete case analysis (in view of the large amount of missingness for some models); (b) excluding patients without PCR-confirmed SARS-CoV-2 infection; and (c) excluding patients who met the clinical deterioration outcome within 4 hours of arrival to hospital. We also examined for non-linearity in the a priori univariable predictors using restricted cubic splines, with 3 knots. Finally, we estimated optimism for discrimination and calibration parameters for the a priori univariable predictors using bootstrapping (1,000 iterations), using the rms package in R [25] . The pre-specified study protocol was approved by East Midlands -Nottingham 2 Research Ethics Committee (REF: 20/EM/0114; IRAS: 282900). We identified a total of 37 studies describing prognostic models, of which 19 studies (including 22 unique models) were eligible for inclusion (Supplementary Figure 1 and Table 1 ). Of these, 5 models were not specific to COVID-19, but were developed as prognostic scores for emergency department attendees [26] , hospitalised patients [12, 27] , people with suspected infection [10] or communityacquired pneumonia [11] , respectively. Of the 17 models developed specifically for COVID-19, most (10/17) were developed using datasets originating in China. Overall, discovery populations included hospitalised patients and were similar to the current validation population with the exception of one study that discovered a model using community data [28] , and another that used simulated data [29] . A total of 13/22 models use points-based scoring systems to derive final model scores, with the remainder using logistic regression modelling approaches to derive probability estimates. A total of 12/22 prognostic models primarily aimed to predict clinical deterioration, while the remaining 10 sought to predict mortality alone. When specified, time horizons for prognosis ranged from 1 to 30 days. Candidate prognostic models not included in the current validation study are summarised in Supplementary Table 1 . During the study period, 521 adults were admitted with a final diagnosis of COVID-19, of whom 411 met the eligibility criteria for inclusion (flowchart shown in Supplementary Figure 2 ). Median age of the cohort was 66 years (interquartile range (IQR) 53-79), and the majority were male (252/411; 61.3%). Figure 4) . For all models that provide probability scores for either deterioration or mortality, calibration appeared visually poor with evidence of overfitting and either systematic overestimation or underestimation of risk ( Figure 1 ). Supplementary Figure 5 shows associations between prognostic models with pointsbased scores and actual risk. In addition to demonstrating reasonable discrimination, the NEWS2 and CURB65 models demonstrated approximately linear associations between scores and actual probability of deterioration at 24 hours and mortality at 30 days, respectively. Next, we sought to compare the discrimination of these models for both clinical deterioration and mortality across the range of time horizons, benchmarked against preselected univariable predictors associated with adverse outcomes in COVID-19 [8, 19] . We recalculated time-dependent AUROCs for each of these outcomes, stratified by time horizon to the outcome ( Supplementary Figures 6 and 7) . These analyses showed that AUROCs generally declined with increasing time horizons. Admission oxygen saturation on room air was the strongest predictor of in-hospital deterioration (AUROC 0.76; 95% CI 0.71-0.81), while age was the strongest predictor of in-hospital mortality (AUROC 0.76; 95% CI 0.71-0.81). We compared net benefit for each prognostic model (for its original intended endpoint) to the strategies of treating all patients, treating no patients, and using the most discriminating univariable predictor for either deterioration (i.e. oxygen saturation on air) or mortality (i.e. patient age) to stratify treatment (Supplementary Figure 8) . Although all prognostic models showed greater net benefit than treating all patients at the higher range of threshold probabilities, none of these models demonstrated consistently greater net benefit than the most discriminating univariable predictor, across the range of threshold probabilities (Figure 2 ). Recalculation Figure 9) . Finally, internal validation using bootstrapping showed near zero optimism for discrimination and calibration parameters for the univariable models (Supplementary Table ) . In this observational cohort study of consecutive adults hospitalised with COVID-19, we systematically evaluated the performance of 22 prognostic models for COVID-19. These included models developed specifically for COVID-19, along with existing scores in routine clinical use prior to the pandemic. For prediction of both clinical deterioration or mortality, AUROCs ranged from 0.56-0.78. NEWS2 performed reasonably well for prediction of deterioration over a 24-hour interval, achieving an AUROC of 0.78, while the Carr 'final' model [31] also had an AUROC of 0.78, but tended to systematically underestimate risk. All COVID-specific models that derived an outcome probability of either deterioration or mortality showed poor calibration. We found that oxygen saturation (AUROC 0.76) and patient age (AUROC 0.76) were the most discriminating single variables for prediction of inhospital deterioration and mortality respectively. These predictors have the added advantage that they are immediately available at the point of presentation to hospital. In decision curve analysis, which is dependent upon both model discrimination and calibration, no prognostic model demonstrated clinical utility consistently greater than using these univariable predictors to inform decision-making. While previous studies have largely focused on novel model discovery, or evaluation of a limited number of existing models, this is the first study to our knowledge to evaluate systematically-identified candidate prognostic models for COVID-19. We used a comprehensive living systematic review [8] to identify eligible models and sought to reconstruct each model as per the original authors' description. We then evaluated performance against its intended outcome and time horizon, wherever possible, using recommended methods of external validation incorporating assessments of discrimination, calibration and net benefit [17] . Moreover, we used a robust approach of electronic health record data capture, supported by manual curation, in order to ensure a high-quality dataset, and inclusion of unselected and consecutive COVID-19 cases that met our eligibility criteria. In addition, we used robust outcome measures of mortality and clinical deterioration, aligning with the WHO Clinical Progression Scale [16] . A weakness of the current study is that it is based on retrospective data from a single centre, and therefore cannot assess between-setting heterogeneity in model performance. Second, due to the limitations of routinely collected data, predictor variables were available for varying numbers of participants for each model, with a large proportion of missingness for models requiring lactate dehydrogenase and D-dimer measurements. We therefore performed multiple imputation, in keeping with recommendations for development and validation of multivariable prediction models, in our primary analyses [32] . Findings were similar in the complete case sensitivity analysis, thus supporting the robustness of our results. Future studies would benefit from standardising data capture and laboratory measurements prospectively to minimise predictor missingness. Thirdly, a number of models could not be reconstructed in our data. For some models, this was due the absence of predictors in our dataset, such as those requiring computed tomography imaging, since this is not currently routinely recommended for patients with suspected or confirmed COVID-19 [15] . We were also not able to include models for which the parameters were not publicly available. This underscores the need for strict adherence to reporting standards in multivariable prediction models [13] . Finally, we used admission data only as predictors in this study, since most prognostic scores are intended to predict outcomes at the point of hospital admission. We note, however, that some scores are designed for dynamic in-patient monitoring, with NEWS2 showing reasonable discrimination for deterioration over a 24-hour interval, as originally intended [27] . Future studies may integrate serial data to examine model performance when using such dynamic measurements. Despite the vast global interest in the pursuit of prognostic models for COVID-19, our findings show that none of the COVID-19-specific models evaluated in this study can currently be recommended for routine clinical use. In addition, while some of the evaluated models that are not specific to COVID-19 are routinely used and may be of value among in-patients [12, 27] , people with suspected infection [10] or community-acquired pneumonia [11] , none showed greater clinical utility than the strongest univariable predictors among patients with COVID-19. Our data show that admission oxygen saturation on air is a strong predictor of clinical deterioration and may be evaluated in future studies to stratify in-patient management and for remote community monitoring. We note that all novel prognostic models for COVID-19 assessed in the current study were derived from single-centre data. Future studies may seek to pool data from multiple centres in order to robustly evaluate the performance of existing and newly emerging models across heterogeneous populations, and develop and validate novel prognostic models, through individual participant data meta-analysis [33] . Such an approach would allow assessments of between-study heterogeneity and the likely generalisability of candidate models. It is also imperative that discovery populations are representative of target populations for model implementation, with inclusion of unselected cohorts. Moreover, we strongly advocate for transparent reporting in keeping with TRIPOD standards (including modelling approaches, all coefficients and standard errors) along with standardisation of outcomes and time horizons, in order to facilitate ongoing systematic evaluations of model performance and clinical utility [13] . We conclude that baseline oxygen saturation on room air and patient age are strong predictors of deterioration and mortality, respectively. None of the prognostic models evaluated in this study offer incremental value for patient stratification to these univariable predictors when using admission data. Therefore, none of the evaluated prognostic models for COVID-19 can be recommended for routine clinical implementation. Future studies seeking to develop prognostic models for COVID-19 should consider integrating multi-centre data in order to increase generalisability of findings, and should ensure benchmarking against existing models and simpler univariable predictors. MEWS = modified early warning score; qSOFA = quick sequential (sepsis-related) organ failure assessment; REMS = rapid emergency medicine score; NEWS = national early warning score; TACTIC = therapeutic study in pre-ICU patients admitted with COVID-19; AVPU = Alert / responds to voice / responsive to pain / unresponsive; CRP = C-reactive protein; LDH = lactate dehydrogenase; RALE = radiographic assessment of lung edema; ARDS = acute respiratory distress syndrome; ICU = intensive care unit; ECMO = extra-corporeal membrane oxygenation. Units, unless otherwise specified, are: age in years; respiratory rate in breaths per minute; heart rate in beats per minute; blood pressure in mmHg; temperature in °C; oxygen saturation in %; CRP in mg/L; LDH in U/L; neutrophils, lymphocytes, total white cell count and platelets x 10^9/L; D-dimer in ng/mL; creatinine in μmol/L; estimated glomerular filtration rate in mL/min/1.73 m2, albumin in g/L. ^Clinician-defined obesity. For each model, performance is evaluated for its original intended outcome, shown in 'Primary outcome' column. AUROC = area under the receiver operating characteristic curve; CI = confidence interval. For each plot, the blue line represents a Loess-smoothed calibration curve from the stacked multiply imputed datasets and rug plots indicate the distribution of data points. No model intercept was available for the Caramelo or Colombi 'clinical' models; the intercepts for these models were calibrated to the validation dataset, by using the model linear predictors as offset terms. The primary outcome of interest for each model is shown in the plot sub-heading. For each analysis, the endpoint is the original intended outcome and time horizon for the index model. Each candidate model and univariable predictor was calibrated to the validation data during analysis to enable fair, head-to-head comparisons. Delta net benefit is calculated as net benefit when using the index model minus net benefit when: (1) treating all patients; and (2) using the most discriminating univariable predictor. The most discriminating univariable predictor is admission oxygen saturation (SpO2) on room air for deterioration models and patient age for mortality models. Delta net benefit is shown with Loess-smoothing. Black dashed line indicates threshold above which index model has greater net benefit than the comparator. Individual decision curves for each candidate model are shown in Supplementary Figure 8 . For each model, performance is evaluated for an approximation of its original intended outcome, shown in 'Primary outcome' column. AUROC = area under the receiver operating characteristic curve; CI = confidence interval. Score Primary outcome AUROC (95% CI) Calibration slope (95% CI) Calibration in the large (95% CI) Deterioration (1 day Presenting Characteristics, Comorbidities, and Outcomes Among 5700 Patients Hospitalized With COVID-19 in the Features of 20 133 UK patients in hospital with covid-19 using the ISARIC WHO Clinical Characterisation Protocol: prospective observational cohort study Critical Care Utilization for the COVID-19 Outbreak in Lombardy, Italy Report 17 -Clinical characteristics and predictors of outcomes of hospitalised patients with COVID-19 in a London NHS Trust: a retrospective cohort study | Faculty of Medicine | Imperial College London The demand for inpatient and ICU beds for COVID-19 in the US: lessons from Chinese cities Remdesivir for the Treatment of Covid-19 -Preliminary Report Effect of Dexamethasone in Hospitalized Patients with COVID-19: Preliminary Report Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal PROBAST: A Tool to Assess the Risk of Bias and Applicability of Prediction Model Studies Assessment of Clinical Criteria for Sepsis Defining community acquired pneumonia severity on presentation to hospital: an international derivation and validation study Royal College of Physicians. National Early Warning Score (NEWS) 2 | RCP London Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement Frequency and Distribution of Chest Radiographic Findings in COVID-19 Positive Patients COVID-19 Resources | The British Society of Thoracic Imaging A minimal common outcome measure set for COVID-19 clinical research Prognosis research in healthcare : concepts, methods, and impact Time-dependent ROC curve analysis in medical research: current methods and applications The effect of frailty on survival in patients with COVID-19 (COPE): a multicentre, European, observational cohort study A simple, step-by-step guide to interpreting decision curve analysis Risk Model Decision Analysis Multiple imputation using chained equations: Issues and guidance for practice Multivariate Imputation by Chained Equations in R Multiple imputation for nonresponse in surveys rms: Regression Modeling Strategies Rapid Emergency Medicine Score: A New Prognostic Tool for In-Hospital Mortality in Nonsurgical Emergency Department Patients The ability of the National Early Warning Score (NEWS) to discriminate patients at risk of early cardiac arrest, unanticipated intensive care unit admission, and death Predicting mortality due to SARS-CoV-2: A mechanistic score relating obesity and diabetes to COVID-19 outcomes in Mexico Estimation of risk factors for COVID-19 mortalitypreliminary results Sample size considerations for the external validation of a multivariable prognostic model: a resampling study Evaluation and Improvement of the National Early Warning Score (NEWS2) for COVID-19: a multi-hospital study Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD): Explanation and Elaboration Cochrane IPD Meta-analysis Methods group. Individual Participant Data (IPD) Meta-analyses of Diagnostic and Prognostic Modeling Studies: Guidance on Their Use Validation of a modified Early Warning Score in medical admissions Well-aerated Lung on Admitting Chest CT to Predict Adverse Outcome in COVID-19 Pneumonia A clinical risk score to identify patients with COVID-19 at high risk of critical care admission or death: An observational cohort study Development and validation of an early warning score (EWAS) for predicting clinical deterioration in patients with coronavirus disease 2019 Early prediction of mortality risk among severe COVID-19 patients using machine learning Prognostic factors for COVID-19 pneumonia progression to severe symptom based on the earlier clinical features: a retrospective analysis Prediction for Progression Risk in Patients with COVID-19 Pneumonia: the CALL Score ACP risk grade: a simple mortality index for patients with confirmed or suspected severe acute respiratory syndrome coronavirus 2 disease (COVID-19) during the early stage of outbreak in Wuhan, China Host susceptibility to severe COVID-19 and establishment of a host risk score: findings of 487 cases outside Wuhan Development and external validation of a prognostic multivariable model on admission for hospitalized patients with COVID-19 An interpretable mortality prediction model for COVID-19 patients Risk prediction for poor outcome and death in hospital inpatients with COVID-19: derivation in Wuhan, China and external validation in London Comparing Rapid Scoring Systems in Mortality Prediction of Critically Ill Patients With Novel Coronavirus Disease REMS Mortality (in-hospital) Caramelo Mortality (in-hospital) Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal Predicting COVID-19 malignant progression with AI techniques Performing risk stratification for COVID-19 when individual level data is not available, the experience of a large healthcare organization Holistic AI-Driven Quantification, Staging and Prognosis of COVID-19 Pneumonia Predicting community mortality risk due to CoVID-19 using machine learning and development of a prediction tool A Tool for Early Prediction of Severe Coronavirus Disease 2019 (COVID-19): A Multicenter Study Using the Risk Nomogram in Wuhan and Guangdong, China Towards an Artificial Intelligence Framework for Data-Driven Prediction of Coronavirus Clinical Severity Consortium N & MC-19 R. Development and Validation of a Survival Calculator for Hospitalized Patients with COVID-19 COVID-19 for the CMTEG for. Development and Validation of a Clinical Risk Score to Predict the Occurrence of Critical Illness in Hospitalized Patients With COVID-19 Prediction of the clinical outcome of COVID-19 patients using T lymphocyte subsets with 340 cases from Wuhan, China: a retrospective cohort study and a web visualization tool Clinical Decision Support Tool and Rapid Point-of-Care Platform for Determining Disease Severity in Patients with COVID-19 Predicting Mortality Risk in Patients with COVID-19 Using Artificial Intelligence to Help Medical Decision-Making Machine learning-based CT radiomics model for predicting hospital stay in patients with pneumonia associated with SARS-CoV-2 infection: A multicenter study A Machine Learning Model Reveals Older Age and Delayed Hospitalization as Predictors of Mortality in Patients with COVID-19 Evaluating a Widely Implemented Proprietary Deterioration Index Model Among Hospitalized COVID-19 Patients Machine Learning to Predict Mortality and Critical Events in COVID-19 Positive TOWARD A COVID-19 SCORE-RISK ASSESSMENTS AND REGISTRY Association of radiologic findings with mortality of patients infected with 2019 novel coronavirus in Wuhan Risk assessment of progression to severe conditions for patients with COVID-19 pneumonia: a single-center retrospective study Logistic regression Carr et al [31] ' Carr [47] , and thus were included in the present study. $ No model intercept was available; the intercepts for these models were therefore calibrated to the validation dataset, using the model linear predictors as offset terms.^Using oxygen scale 1 for all participants, except for those with target oxygen saturation ranges of 88-92%, e.g. in hypercapnic respiratory failure, when scale 2 is used, as recommended [12] . All candidate models included in a living systematic review were considered at high risk of bias [1] . ARDS = acute respiratory distress syndrome; ICU = intensive care unit; CT = computed tomography. Optimism is calculated using bootstrapping with 1,000 iterations. AUROC = area under the receiver operating characteristic curve; CI = confidence interval. Dxy = Somers' Delta, which is a measure of agreement between pairs of ordinal variables, ranging from -1 (no agreement) to +1 (complete agreement). Dxy