key: cord-1032859-nb64p7v9 authors: Werfel, S.; Jakob, C. E. M.; Borgmann, S.; Schneider, J.; Spinner, C.; Schons, M.; Hower, M.; Wille, K.; Haselberger, M.; Heuzeroth, H.; Ruethrich, M.; Dolff, S.; Kessel, J.; Heemann, U.; Vehreschild, J. J.; Rieg, S.; Schmaderer, C. title: Development and Validation of a Simplified Risk Score for the Prediction of Critical COVID-19 Illness in Newly Diagnosed Patients date: 2021-02-09 journal: nan DOI: 10.1101/2021.02.07.21251260 sha: a853157fe361ff8dd86ee28db742d13a762557ca doc_id: 1032859 cord_uid: nb64p7v9 Scores for identifying patients at high risk of progression of the coronavirus disease 2019 (COVID-19), caused by the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), are discussed as key instruments for clinical decision-making and patient management during the current pandemic. Here we used the patient data from the multicenter Lean European Open Survey on SARS-CoV-2 - Infected Patients (LEOSS) and applied a technique of variable selection in order to develop a simplified score to identify patients at increased risk of critical illness or death. A total of 1,946 patients, who were tested positive for SARS-CoV-2 were included in the initial analysis. They were split into a derivation and a validation cohort (n=1,297 and 649, respectively). A stability selection among a total of 105 baseline predictors for the combined endpoint of progression to critical phase or COVID-19-related death allowed us to develop a simplified score consisting of five predictors: CRP, Age, clinical disease phase (uncomplicated vs. complicated), serum urea and D-dimer (abbreviated as CAPS-D score). This score showed an AUC of 0.81 (CI95%: 0.77-0.85) in the validation cohort for predicting the combined endpoint within 7 days of diagnosis and 0.81 (CI95%: 0.77-0.85) during the full follow-up. Finally, we used an additional prospective cohort of 682 patients, who were diagnosed largely after the "first wave" of the pandemic to validate predictive accuracy of the score, observing similar results (AUC for an event within 7 days: 0.83, CI95%, 0.78-0.87; for full follow-up: 0.82, CI95%, 0.78-0.86). We thus successfully establish and validate an easily applicable score to calculate the risk of disease progression of COVID-19 to critical illness or death. The first human cases of Coronavirus disease 2019 (COVID- 19) were described in December 2019 in Wuhan 1 . From that on, COVID-19 has developed to one of the most disastrous pandemics that we have experienced in our civilization since the Spanish flu in the beginning of 20 th century 2, 3 . Exponential spread of the disease-causing Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), as has happened throughout Europe during the first wave of the pandemic, can result in an excessive hospital overload and a shortage of healthcare resources, which may lead to a negative impact on patient outcomes 4 . This experience underpinned the importance of an effective process to allocate limited health care resources towards COVID-19 patients who most likely benefit from them. Consequently, to guarantee functional patient care, disease severity assessment for patients presenting to the emergency department (ED) may prove very helpful and guide frontline physicians in the decisionmaking process. On the one hand, a high number of patients deteriorate rapidly after hospital admission and require transfer to the Intensive care unit (ICU), while on the other hand clinical conditions of other COVID-19 patients improve quite fast. In this respect, a prediction model can support physicians at determining if patients require hospital admission or can be can be followed up in outpatient care. A risk assessment score may additionally prove to be a helpful tool for estimating the individual risk benefit tradeoff for therapeutic interventions. The aim of the current study was to develop a simplified risk prediction model based on clinical and demographic characteristics and laboratory findings present at the time 4 4 Among the available baseline variables of the LEOSS dataset (≈170 predictors) we selected those with less than 50% missing values among the combined derivation and validation dataset (n=1946 patients, as shown in Figure 2 ), with an exception made for Troponin T (52% missing) and pancreas lipase (56% missing). This resulted in a total of 105 predictors (as listed in the Table 1 and Supplementary Table 1 ). Since the time to event data in the anonymized LEOSS cohort was grouped for patients experiencing an event at ≥8 days after study inclusion, the time variable was coded accordingly as 1 to 7 days and ≥8 days, resulting in 8 bins for the time variable (Supplementary Table 1 ). These were used for the time-to-event approaches: random survival forest and Cox models and for C-index calculation. Continuous predictors are binned as value ranges in the LEOSS cohort due to anonymization and for the analysis the ranges were coded as consecutively increasing integers. We performed unsupervised random forest missing value imputation using either the data of the combined derivation and validation datasets (n=1946 patients) or, separately, the full test set (n=682 patients, Figure 2 ), while withholding the outcome variables. We thus generated 20 imputed datasets for each of the cohorts. We performed a split into a derivation and a validation cohort with similar characteristics based on the following predefined potential confounders: age, sex, presence of dyspnoe, neutrophil count, lymphocyte count, lactate dehydrogenase (LDH), bilirubin, CRP, PCT, D-dimer, H/O malignant neoplasia, presence of any CV comorbidity (as defined above) and the number of events. To this end we performed 1000 random splits at a 2/3 and 1/3 ratio and calculated for each split and variable the standardized mean difference, selecting the split with the smallest maximal standardized mean difference between these predictors. Variable selection was carried out using the Boruta algorithm 7 at 100 iterations using equal proportions of the 20 imputed derivation datasets and a p-value of 0.01 for selection. For classification random forests we used the presence of an event (critical phase or COVID-19-related death) within 7 days of diagnosis as the outcome of interest during Boruta selection. We used the balanced method by Chen at al. 8 both during Boruta selection and modelling with the selected variables. Likewise, we used survival random forest as described by Ishwaran et al. 9 both during Boruta selection and during final modelling of time to event data. For survival random forests, since they take time to event into account, also events occurring longer than 7 days after diagnosis were included. Variable importance was calculated using permutation. For Cox and logistic (binomial) regression models we performed ridge (L2) penalization optimized using 20x fold cross-validation on the imputed derivation datasets. Score values were calculated from the ridge penalized binomial regression coefficients of the model containing the five selected predictors on the derivation dataset with missing values replaced with the most common value of the 20 imputed datasets for this patient and predictor and event within 7 days as outcome. Finally, the regression coefficients were divided by the smallest one and rounded to the next whole integer. Two-sided pvalues for binomial ridge penalized coefficients were obtained as suggested by Cule et al. 10 , by repeating the ridge regression procedure on a dataset with randomly permuted outcomes 1000 times (using equal amounts of the 20 imputed datasets). Area under the receiver operating characteristics curve (AUC) and (Harrell's) Cindices were calculated using linear predictors from the binomial and Cox ridgepenalized regression models or out-of-bag (OOB) predictor estimates for the random forest approaches. 95% confidence intervals for AUC and C-indices were calculated using 1000 bootstraps of patients' scores using equal contributions of the imputed datasets. All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in Important characteristics of the LEOSS cohort were described previously 5 . More diagnosed SARS-CoV-2 cases were available for the current analysis compared to this previous report (2,969 in the first dataset, patients from the first wave of the pandemic, and 1,233 patients in a second test set, Figure 2 ) 5 . Based on the predefined disease phase (Figure 1 ) and the availability of laboratory values, a total of 1,946 patients were included in the first round of analysis and split into a derivation and validation groups with similar characteristics (Figure 2 ). Important characteristics are summarized in Table 1 with a summary of the remaining predictors provided in Supplementary Table 1. The age distribution in the first dataset was centered, with about equal contribution of patients of ≤65 and >65 years. There were more men than women at 55-59% to 41-45%. At least 56% presented with a known CV comorbidity. The incidence of the combined endpoint, critical phase or COVID-19-related death, within 7 days was 14-16%, and 20% when including any time point during the follow-up (Table 1) . From the second test set (patients whose cases were entered into the registry after the first data export for score derivation) 682 patients passed selection criteria. This set largely consisted of patients diagnosed after June 2020 ( Figure 2 ). Compared to the derivation/validation cohorts, the patients were younger (60% with an age of ≤65 years) and more were diagnosed in an uncomplicated phase (72% vs. 64-68%). Consequently, the event rate was lower with only 10% experiencing an event within 7 days of diagnosis and 12% during the full follow-up (Table 1 ). Both the derivation and validation datasets consisted almost exclusively of patients receiving inpatient care. We performed Boruta variable stability selection using random forest for classification (RF), resulting in the selection of 5 (out of 105) predictors (Table 2 ). These were: CRP, disease phase, age, serum urea, and D-dimer (Supplementary Figure 1A) . Interestingly, including only these five predictors in a logistic regression model achieved results almost on par with the full set of variables (Table2, "RF Boruta", Binomial ridge, median AUC=0.81 in the validation cohort). We additionally performed Boruta stability selection using a survival random forest (RSF) approach. Here, 24 predictors were retained, with the five predictors from RF Boruta being among the variables with the highest importance (Supplementary Figure 1B) . Increasing the number of predictors to 24 had only a minor impact on model performance in the validation dataset compared to five predictors, as measured by Harrell's C-index (median C-index of 0.77 vs. 0.76 for five predictors, Supplementary Table 2 ). Based on the encouraging results and the simple interpretability, we used the coefficients obtained in the binomial ridge regression model with five predictors (Table 3) to derive an additive score for prediction of COVID-19 progression to critical phase or death. The score is outlined in Table 4 . It showed a similar performance when compared to the binomial model both in the derivation and the validation datasets (median AUC in the validation dataset for events within 7 days of diagnosis: 0.81, 95% confidence interval (95% CI), 0.77-0.85, and for all events, 0.81, 95% CI, 0.77-0.85, Table 2 ). Interestingly, the simplified score also showed a similar performance to a All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted February 9, 2021. ; 6 6 Cox regression or an RSF approach with both the five and 24 predictors as measured by Harrell's C-index (median C-index of 0.76, 95% CI, 0.73-0.80 in the validation cohort, Supplementary Table 2) . As an independent prospective validation group, we used the second test set of patients whose data was entered into the registry after the initial data export (n=682 patients, "full test set" in Figure 2 ). To further reduce the impact of missing values on the estimation of score performance, we additionally removed patients from centers with >20% missing values for D-dimer, the variable with most missing values (42-47% missing). Centers which enrolled less than five patients were also removed. This resulted in an additional "limited test set" (n=219 patients, Figure 2 ). This dataset had only a minor portion of missing values (CRP: 1%, serum urea: 2%, D-dimer: 7% missing, Table 2 Table 2) . Depending on the clinical application, different cutoffs may be considered. We therefore provide the predictive metrics of the score, such as sensitivity, specificity and the positive and negative predictive values (PPV, NPV) vs. the cutoff ( Figure 3 ) as well as the absolute event risks for specific score values (Supplementary Figure 2 ). Next to the discriminative performance, we observed good calibration with a slope ranging from 0.944 to 1.101 in the different validation/test datasets (Supplementary Figure 3) . Interestingly, the Brier score was tendentially smaller in the "full test" compared to the validation dataset (0.076-0.091 vs. 0.106-0.124, Supplementary Figure 3 ), mirroring the tendency towards a better discriminative performance in this dataset (Table 2 and Supplementary Table 2 ). Calibration-in-the-large for the "full test" set, which showed a lower event per case rate, was similar to that in the validation set for an event within 7 days (intercept of -0.181 vs. -0.182), with a higher difference for all events (intercept of -0.334 vs. 0.004, potentially reflecting the differences in the event rates between the cohorts). One method to select a cutoff is by optimizing the modified Youden's J 11 . For the proposed score the optimal J in the combined validation and full test dataset was at a cutoff of ≥17 both for predictions at 7 days after diagnosis and for all events. Applying this cutoff, on average 69% of patients are predicted to not progress to critical illness ( Table 5 , combined validation/test dataset) at an NPV of 95% for 7 days after diagnosis and an NPV of 94% for the full follow-up. Patients with scores at or above this threshold had ≈3-fold increased odds of experiencing an event, while patients below this threshold had ≈3-fold decreased odds as measured by the respective likelihood ratios (Table 5 ). Here we describe the derivation and validation of a COVID-19 risk score for the prediction of the combined end point of critical disease or COVID-19-related death with five predictors. We derive the score in an untargeted way by selecting the most stable predictors among 105 available at baseline in the LEOSS registry in a random forest approach and use regularized regression to calculate the coefficients. All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted February 9, 2021. ; https://doi.org/10.1101/2021.02.07.21251260 doi: medRxiv preprint 7 7 A number of approaches for COVID-19 risk stratification have been reported previously (reviewed by Wynants et al. 12 ), several with a similar aim of predicting critical disease, as indicated by admission to the ICU (e.g. [13] [14] [15] or death (e.g. [14] [15] [16] [17] . The availability of factors such as hospital or ICU beds has been limited during the high tide of the pandemic and the resulting strain on health care systems. Thus, difficulties in generalizing outcome predictions obtained under these constraints in currently available scores may arise. In our view some important limiting aspects have to be considered. On the one hand, if hospital beds are limited, the study population for inpatient analyses may overrepresent patients with symptoms of exceptional severity and high risk groups, which may limit generalizability (as e.g. noted for the ISARIC 4C score 16 ). Similarly, if ICU resources are limited, the indications for admission may be more conservative, thus a patient may be identified as having a favorable outcome (not being admitted to an ICU) despite having fulfilled clinical criteria at some point. This is exemplified by the finding that one of the predictors included in the COVID-GRAM score was unconsciousness, which may already indicate an outcome of an advanced disease. Another important consideration is for the generalizability of mortality as an outcome for patient stratification. Case fatality rates have been widely differing across countries 18 , even within the European Union, which may be at least in part attributed to country-specific differences in clinical management of COVID-19 patients and to resource availability during the first wave of the pandemic 4 . This may limit generalizability and potentially require an update to existing scores for mortality prediction 19 as care providers gain experience with COVID-19 management and the strain on hospitals is reduced. A previous review on COVID-19 prognosis scores came to an overall negative assessment of the potential bias of these scores and as a result discouraged their use 12 . A combination of characteristics sets apart our approach compared to those available (to our knowledge) at the time of writing and makes it potentially better generalizable for future clinical application: (a) the outcome was not defined in terms of a specific treatment (or lack thereof, i.e. admission to the ICU), but rather based on clinical features (a predefined "critical phase"); (b) the inclusion was based on predefined clinical criteria ("uncomplicated" or "complicated" phase) and (c) the use of a stability selection approach as a means of reducing the number of predictors, as further discussed below. Additionally, the vast majority (>90%) of the cases enrolled into the LEOSS cohort were from Germany 5 , where the capacity of the healthcare system was in general not exceeded during the first wave of the pandemic 20 . To address bias in predictor selection we used an untargeted approach and resampling techniques (stability selection and cross-validated ridge regression) in order to first internally test the predictions on the derivation dataset and then validate them on a withheld validation cohort. Stability selection helps to ensure the internal validity and sufficient sample size already for the derivation dataset, as a too small sample will reduce variable stability and lead to less variables being selected. Ridge regression shrinks the regression coefficients to achieve improved predictions in a binomial model again with internal (cross-)validation already in the derivation dataset. As a result of the above steps, we could successfully confirm the performance of our score in an independent test set, consisting in its majority of COVID-19 cases diagnosed after the first wave of the pandemic. An important contributor to the predictive performance of the final score was the predefined clinical phase ("complicated" vs. "uncomplicated"), which summarizes the presence of a manifest organ involvement of the lungs, heart or liver. Of note, some All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted February 9, 2021. ; https://doi.org/10.1101/2021.02.07.21251260 doi: medRxiv preprint 8 8 parameters of the complicated phase, such as arterial partial pressure of oxygen (PaO2) and pericardial effusion were acquired by indication (e.g. if an echocardiography or arterial blood gas analysis were performed, but not routinely), therefore for phase assignment these do not have to be taken into account in absence of an indication for the respective measurement. Serum urea, likely as a measure of kidney involvement, was also an important predictor and outperformed creatinine, as reported previously for mortality 16, 21 . This predictor potentially summarizes both preexisting chronic kidney disease (CKD) as a risk factor (as e.g. reported by Williamson et al. 22 and in Supplementary Figure 1B) and acute kidney injury (AKI) due to COVID-19 as organ involvement (also stable in RSF Boruta, Supplementary Figure 1B) . Different mechanisms of AKI in COVID-19 patients have been observed, including indirect involvement e.g. due to a cardiorenal syndrome, direct virus-induced injury as well as immunologic causes such as complement activation (as reviewed in 23, 24 ). Differentiating the type of acute kidney involvement in COVID-19 patients may provide further insights and refine risk stratification in future analyses. Overall, the presented score, despite being limited to only 5 predictors and applying a point system, compared well to more complex prediction models. 14, 16 We suggest a threshold for patients with an increased risk of critical disease at ≥ 17 points, based on the modified Youden's J. At this threshold we obtained a positive likelihood ratio of 3-fold while retaining a good negative predictive value of 94-95%. Different cutoffs may be considered based on the application and local circumstances (e.g. different local ratio of critical disease per case, travelling time to the next hospital in case of deterioration in an outpatient setting, etc.). The graphs provided in Figure 3 for sensitivity/specificity and PPV/NPV (based on the prevalence in the validation and test datasets) as well as in Supplementary Figure 2 for absolute risk prediction may assist in determining such thresholds. Our study has limitations. The LEOSS registry is anonymized and continuous parameters were categorized, thus potentially reducing the predictive performance of e.g. laboratory measures. As a real-world dataset, given the heterogeneity of clinical procedures across centers, our analysis had to compensate for missing values. This typically reduces the predictive performance of the respective variables and the probability that they pass stability selection criteria. Thus, some predictors may have been underestimated or missed. Our analysis was limited to predicting disease progression with information obtained at the time point of first positive SARS-CoV-2 testing (typically occurring during presentation at the medical facility), without taking into account the dynamics of the predictors. In this regard, the days since the beginning of symptoms (uncomplicated phase) to the diagnosis were included as a variable, however it did not pass stability criteria. Also, there were differences between the validation and the test dataset with the latter having a higher proportion of patients diagnosed in the uncomplicated phase (suggesting earlier diagnosis, possibly due to expanded testing capacities after the first wave). Yet, the score still showed a similar or tendentially even better performance in the test set. This indirect evidence suggests that the application of our score may be valid also at time points after diagnosis (or the initial presentation), such as if the patient's condition or laboratory values deteriorate, however further studies are needed to assess its suitability in such settings. No information on patient ethnicity/race was available for current analysis, it may be assumed that the distribution follows that in the German population and represents largely Caucasians, which may limit generalizability. External validation in different patient populations is thus needed, also with regard to socioeconomic factors and local standards of care. Extensive information on comorbid conditions for the study participants was available. Although some passed the criteria in RSF stability selection, none passed RF stability criteria. Yet, having more predictors (24 vs. 5) did not improve the overall predictive performance. This suggests that the increased risk due to these comorbidities may already be reflected by the remaining five predictors (collinearity), therefore relieving the need for inclusion into the score. However, this may not hold true for less common comorbidities, as the overall prediction improvement will be low for low prevalence predictors, even if they have a strong effect for patients suffering from these comorbidities. Thus, a score based on the total population, as presented here, may underestimate high-risk constellations due to rare comorbidities such as specific cancers, autoimmune diseases/immunosuppressive treatments, etc. To our knowledge this limitation applies to most if not all available COVID-19 prognosis scores which were derived on the total population. Yet, such patients may deteriorate rapidly. It seems important to establish the additional risk for specific conditions on top of the used score in future studies. 10 All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in 95%CI, 0.76-0.90 Initial derivation and validation analyses were performed on the respective datasets (n=1297 and 649, respectively) as summarized in Figure 2 . As indicated, the final score was additionally independently validated on the full and the limited test sets (n=675 and 218, as described in Figure 2 ). Indicated are the median values and the full range for the imputed datasets (in brackets). AUC values were calculated for an event within 7 days of diagnosis ("7 d") and for all time points ("all"). 95% confidence intervals (95%CI) were calculated for score predictions using bootstrapping with equal contributions of the imputed datasets. Abbreviations: AUC, area under the receiver operating characteristic (ROC) curve; imp., imputation; N pr., number of predictors in the model; pr., predictors; RF, random forest for classification. All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in Indicated are β coefficients from binomial ridge regression (outcome: event within 7 days) and the resulting weights per step increase in the respective predictor group (all groups are listed in Table 4 ). P-values were calculated using ridge regression on the derivation dataset with permutations of the outcome. All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted February 9, 2021. ; https://doi.org/10.1101/2021.02.07.21251260 doi: medRxiv preprint 18 15 Abbreviations: LN, laboratory normal range, "x" indicates multiples of the upper limit of the normal range. All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted February 9, 2021. ; https://doi.org/10.1101/2021.02.07.21251260 doi: medRxiv preprint 19 16 All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted February 9, 2021. ; https://doi.org/10.1101/2021.02.07.21251260 doi: medRxiv preprint A Novel Coronavirus from Patients with Pneumonia in China Covid-19 compared to other pandemic diseases Back to the Future: Lessons Learned From the 1918 Influenza Pandemic Potential association between COVID-19 mortality and health-care resource availability Fast Unified Random Forests for Survival, Regression, and Classification (RF-SRC) Feature selection with the Boruta package Using random forest to learn imbalanced data Random survival forests Significance testing in ridge regression for genetic data The inconsistency of 'optimal' cutpoints obtained using two criteria based on the receiver operating characteristic curve Prediction models for diagnosis and prognosis of covid-19: Systematic review and critical appraisal Development and Validation of a Clinical Risk Score to Predict the Occurrence of Critical Illness in Hospitalized Patients with COVID-19 Machine Learning to Predict Mortality and Critical Events in COVID-19 Positive New York City Patients: A Cohort Study (Preprint) Prediction model and risk scores of ICU admission and mortality in Risk stratification of patients admitted to hospital with covid-19 using the ISARIC WHO Clinical Characterisation Protocol: Development and validation of the 4C Mortality Score Early prediction of mortality risk among patients with severe COVID-19, using machine learning Mortality Analyses -Johns Hopkins Coronavirus Resource Center Prediction models for covid-19 outcomes Lessons learnt from easing COVID-19 restrictions: an analysis of countries and regions in Asia Pacific and Europe A validated, realtime prediction model for favorable outcomes in hospitalized COVID-19 patients Factors associated with COVID-19-related death using OpenSAFELY Management of acute kidney injury in patients with COVID-19 COVID-19-associated acute kidney injury: consensus report of the 25th Acute Disease Quality Initiative (ADQI) Workgroup Dr. Spinner reports grants, personal fees and non-financial support from Gilead Sciences, grants and personal fees from Janssen-Cilag, personal fees from Formycon, other from Aperion, other from Eli Lilly, during the conduct of the study; personal fees from AbbVie, personal fees from MSD, grants and personal fees from GSK/ViiV Healthcare, outside the submitted work. Dr. Rüthrich reports grants from IZKF, outside the submitted work. The LEOSS registry was supported by the German Center for Infection Research (DZIF) and the Willy Robert Pitzer Foundation. 4 (2%) Abbreviations: 7d, event (critical phase or COVID-19-related death) within 7 days of diagnosis; LN, laboratory normal range, "x" indicates multiples of the upper limit of the normal range; Test, f., full test set (as shown in Fig. 2) ; Test, l., limited test set (as shown in Fig. 2 ).