key: cord-0793282-qmbqlygn authors: Calvillo-Batllés, P.; Cerdá-Alberich, L.; Fonfría-Esparcia, C.; Carreres-Ortega, A.; Muñoz-Núñez, C.F.; Trilles-Olaso, L.; Martí-Bonmatí, L. title: Development of severity and mortality prediction models for covid-19 patients at emergency department including the chest x-ray() date: 2022-01-21 journal: Radiologia (Engl Ed) DOI: 10.1016/j.rxeng.2021.09.004 sha: 0a9698b454b1d089aee3bc45d31e51b6c2688c38 doc_id: 793282 cord_uid: qmbqlygn OBJECTIVES: To develop prognosis prediction models for COVID-19 patients attending an emergency department (ED) based on initial chest X-ray (CXR), demographics, clinical and laboratory parameters. METHODS: All symptomatic confirmed COVID-19 patients admitted to our hospital ED between February 24th and April 24th 2020 were recruited. CXR features, clinical and laboratory variables and CXR abnormality indices extracted by a convolutional neural network (CNN) diagnostic tool were considered potential predictors on this first visit. The most serious individual outcome defined the three severity level: 0) home discharge or hospitalization ≤ 3 days, 1) hospital stay >3 days and 2) intensive care requirement or death. Severity and in-hospital mortality multivariable prediction models were developed and internally validated. The Youden index was used for the optimal threshold selection of the classification model. RESULTS: A total of 440 patients were enrolled (median 64 years; 55.9% male); 13.6% patients were discharged, 64% hospitalized, 6.6% required intensive care and 15.7% died. The severity prediction model included oxygen saturation/inspired oxygen fraction (SatO2/FiO2), age, C-reactive protein (CRP), lymphocyte count, extent score of lung involvement on CXR (ExtScoreCXR), lactate dehydrogenase (LDH), D-dimer level and platelets count, with AUC-ROC = 0.94 and AUC-PRC = 0.88. The mortality prediction model included age, SatO2/FiO2, CRP, LDH, CXR extent score, lymphocyte count and D-dimer level, with AUC-ROC = 0.97 and AUC-PRC = 0.78. The addition of CXR CNN-based indices did not improve significantly the predictive metrics. CONCLUSION: The developed and internally validated severity and mortality prediction models could be useful as triage tools in ED for patients with COVID-19 or other virus infections with similar behaviour. patterns of lung involvement [1] [2] [3] [4] [5] . However, studies on the utility of the chest X-ray (CXR) for predicting health outcomes are limited [6] [7] [8] [9] and the prognostic studies have mainly been based on chest CT [10] [11] [12] . Considering the higher use of CXR, its larger availability and safer use to control the spread of the virus when compared with CT, we aimed to develop two multivariable prediction models for severity and mortality estimations in COVID-19 taking into consideration the radiological, demographic, clinical and laboratory variables registered on the emergency evaluation. The institutional review board approved this retrospective study. This research did not receive any specific grant from funding agencies in the public, commercial, or not-forprofit sectors. It followed the Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis: the TRIPOD statement 13 Figure 1 ). Initial CXR readings on admission were distributed among five radiologists with an average of 11 years of experience in thoracic imaging, blinded to the rest of parameters and outcome. The following items were described ( Figure 2) :  Absence (level 0) or presence and density of opacities: only low-density (level 1) or consolidation (+/-low density) (level 2). Lung opacities were considered "lowdensity" if the attenuation did not conceal the underlying vessels and "consolidation" if the opacification of the parenchyma obscured the underlying vessels. J o u r n a l P r e -p r o o f  Extension degree and score of lung involvement: the extent was graded as mild (if size opacity was less than 1 field); moderate (1-2 fields involved); extensive (3-4 fields involved) and very extensive (5-6 fields involved). A numerical value was assigned to each field depending on the percentage with increased attenuation: 0 (0%), 1 (≤50%), and 2 (>50%). A total score of the lung involvement extent (ExtScoreCXR) was reached by adding up the six field scores, obtaining a value from 0 to 12. The ExtScoreCXR was created by the authors after considering it by consensus a simple, fast, reproducible and optimal method of semi-quantification of the lung involvement extent. The extent score from the division into lung fields has also been used by other authors in patients with COVID-19 [6] [7] [8] 15 .  The imaging variables were agreed upon by the radiologists from the X-ray of the first 80 cases detected. House-made repository software was used to record all the variables in a structured shared database, with description and imaging reminders aiming to reduce variability between readers and mandatory fill-in fields to optimize data collection. J o u r n a l P r e -p r o o f Demographics, institutionalization, comorbidities, clinical manifestations, peripheral oxygen saturation (SatO2), laboratory data -C reactive protein (CRP), lactate dehydrogenase (LDH), lymphocyte count, platelet count, and D-dimer-were recorded. We calculated SatO2/FiO2 to avoid data loss from patients with SatO2 obtained under oxygen therapy. FiO2 is the fraction of inspired oxygen and changes depending on the oxygen flow rate delivered to each patient; for room air it is 0.21. Probability indices of lung abnormal findings were extracted from CXR by a Indices for "consolidation", "lung opacity" and "abnormal CXR" were incorporated into the final model to assess whether they improved its predictive accuracy Three severity levels were defined: home discharge or hospitalization ≤ 3 days (level 0), hospital stay >3 days (level 1), need for intensive care unit (ICU) stay or death due to COVID-19 (level 2). Both days of hospitalization and days to death were registered. The median of follow-up was 91 days (range 64-124 days). Correlations between the lung involvement extension on CXR (degrees and score) and the days with symptoms, the SatO2/FiO2 and the variable outcomes were investigated. Spearman, Kendall, Rank or Point biserial tests were used depending on the type of the studied variables. For interpreting the strength of a relationship based on its r-value (using the absolute value of the r-value to make all values positive) we applied the following rule of thumb: r < 0.1 none, 0.1 < r < 0.3 weak, 0.3 < r <0.5 moderate, 0.5 < r < 0.8 strong and r > 0.8 very strong. Different prognostic predictive models were developed using three types of classifiers fitting several weak learners sequentially, where, at each iteration, more weight is added to the observations with the worst prediction from the previous iteration. Since each weak learner is built upon the results from the preceding one, the computation cannot be parallelized and the computation can be longer. An internal validation was performed with an unseen dataset corresponding to the remaining 20% to assess model generalizability and robustness. The model hyperparameters were obtained by performing a grid search strategy, a method for hyperparameters optimization that methodically builds and evaluates a model for each combination of algorithm parameters specified in a grid. In order to avoid redundant information, prevent the models from becoming unstable in the presence of strong feature dependencies and to improve their interpretability, features with high correlation (>80%) were identified with a Spearman's rank-order correlation matrix ( Figure 3 ). From each pair of high correlated features, the one with the larger p-value in the univariate statistical test was excluded from the models. Following this criterion, the extent degrees and the distribution of the opacities in the medium lung field were discarded. Three models were developed with different predictive variables, the first one containing the epidemiological (age, sex, institutionalization and comorbidities) and all the above mentioned radiological features. The clinical (symptoms and SatO2/FiO2) and all the above mentioned laboratory parameters were incorporated into the first model to build the second model, and finally the CNN-derived based data were incorporated into the second model to build the third model. A partial under-sampling methodology followed by a synthetic minority over-sampling technique (SMOTE) was used to address J o u r n a l P r e -p r o o f the data imbalance problem, very common in Machine Learning environments 16 . Features were standardized accordingly. As a method for dimensionality reduction and to evaluate the impact of each feature, the variable importance was calculated. This importance is a measure of by how much removing a variable decreases accuracy, and vice versa -by how much including a variable increases accuracy. The default method to compute variable importance is the mean decrease in impurity (or gini importance) mechanism: At each split in each tree, the improvement in the split-criterion is the importance measure attributed to the splitting variable, and is accumulated over all the trees in the forest separately for each variable. Note that this measure is quite like the R^2 in regression on the training set. If a variable has very little predictive power, removing it may lead to an increase in accuracy due to random noise. Sensitivity, specificity, PPV, NPV, AUC-ROC and precision-recall curves (AUC-PRC) were obtained for each model. The Youden index was used for the optimal threshold selection of the classification model, by maximizing the highest sensibility and NPV for critically ill (or dead) patients and the highest specificity and PPV for the mild severity (or alive) ones. The optimal thresholds were defined on the training data set. A weighted microaverage statistical approach is used to obtain the values per severity level after threshold optimization of the classification model with the Youden index. A macroaverage will compute the metric independently for each class and then take the average (hence treating all classes equally), whereas a micro-average will aggregate the contributions of all classes to compute the average metric, which is particularly useful when the dataset varies in size. In order to test for a potential overestimation of the diagnostic model performance, the metrics were obtained by evaluating the same J o u r n a l P r e -p r o o f thresholds on the internal validation data set. DeLong's test for two correlated ROC curves 17 was used to compare the performance of the models. The following Python and Machine Learning libraries were used to perform the data visualization and statistical analysis in the study: Pandas, Numpy, SciPy, Matplotlib and Scikit Learn. From 445 registered patients, 5 were excluded (1 acute appendicitis, 1 cholangitis, 1 diverticulitis, 1 ictus stoke and 1 cardiac failure). A total final population of 440 was enrolled in the study. The median age was 64 years (range 17-100) and 55.9% were male. 79% suffered ≥1 comorbidities; the most frequent were hypertension, dyslipidaemia and diabetes (Table 1) . Upon their arrival to ED the average of days with symptoms was 6.8 (range 0-30) and the most common symptoms were by this order fever and cough. The average oxygen J o u r n a l P r e -p r o o f 13.6% patients were discharged at home or hospitalized ≤ 3 days; 64% patients were hospitalized (4-54 days, average 17 days); 6.6% required intensive care (2-65 days, average 18 days in ICU) and 15.7% died (0-51 days after admission, average 10 days). The median time between CXR and Real-Time reverse transcription Polymerase Chain Reaction (RT-PCR) was 1 day (range 0-30). 65.9% of patients with pending RT-PCR result showed suggestive COVID-19 lung involvement on CXR, anticipating the definitive diagnosis. The ExtScoreCXR was 3.3 +/-3.07 (average +/-SD) ( Table 3) . From 76 patients initially discharged at home 24% were admitted in a second visit to ED. The first CXR of these patients was normal in 7 and showed very slight or difficult to interpret opacities in 11; in the second visit all presented progression of the lung involvement. Probability indicesaverage +/-standard deviation (range)-for "consolidation", "lung opacity" and "abnormal CXR" obtained from CXR of the population studied were 0,39 +/- The SatO2/FiO2 (33%), the CNN-based index for lung consolidation (13%), the LDH (12%), the ExtScoreCXR (9%), the age (9%), the lymphocyte count (9%), the CRP (7%), (Table 4 ). In this case, the age (43%), the SatO2/FiO2 (20%), the CRP (15%) , the LDH (7%), the ExtScoreCXR (6%), the lymphocyte count (6%) and the D-dimer level (3%) were, in this order, the most weighted predictors. There were no statistically significant differences in terms of the AUC-ROC between the models with and without CNN-based indices (p-value=0.315), suggesting that the addition of the AI parameters does not offer a significant additional improvement on the model performance. Figure 6 shows the ROC and the PRC curves of the internal validation performed with an unseen dataset for a selection of three classification mortality models built with three different combinations of features. In this study, the presence and extent of lung involvement on the initial CXR of COVID-19 patients has a prognostic value. In the univariable analysis the ExtScoreCXR showed moderate correlation with the severity level and mortality, and in the first developed multivariable models based on age, gender and radiological features, the ExtScoreCXR was the strongest predictor of severity and the second predictor of in-hospital mortality to be included in the models. The number of days with symptoms on arrival of patients to ED was not related to the lung involvement extension. In other series neither a significant difference was identified between the severe and non-severe patients, regarding the median days from symptom onset to hospital admission 26 . Tobacco, comorbidities as obesity, hypertension, diabetes, cardiovascular disease, respiratory diseases, cancer history and the presence of fever, dyspnea, haemoptysis and unconsciousness, were also associated to a worse prognosis in some publications 25, [27] [28] [29] , but not in our study. There is probably a data collection bias from the medical records, especially obesity may be underreported. But apparently, the comorbidities and symptoms have a lower relative predictor weight with respect to the definitive variables As potential sources of bias, the severity level is a decision-based clinical outcome, unlike mortality. In order to reduce this bias we grouped at level 0 not only home discharge from the ED but also hospitalization ≤ 3 days. In addition, the follow-up of at least 2 months included patients who returned to the hospital. In these cases all the variables collected were also those obtained from the first visit to the ED but the event considered as outcome was the most severe. In fact, the proportion of the most severe patients (22%) was within the range published in longer series (15-36%) 32, 33 . Regarding the proposed method to quantify the extent of lung involvement (ExtScoreCXR), we have not analyzed interobserver agreement. On the other hand, a good interobserver agreement was demonstrated with the use of Brixia, a more complex score designed for COVID-19 patients 9 , and coincidentally we have used the same score as other authors, who have recently published its good correlation with Brixia score 15 . The internal validation was performed with 88 cases. However, it has been reported that a minimum sample size of 100 is recommended in order to achieve a robust validation 34 . An external validation with cases from other hospitals is desirable to assess the generalizability and the potential use of the developed models in daily clinical practice. (33%), índice basado en la red neuronal convolucional (RNC) para la presencia de hepatización pulmonar en la radiografía torácica (RXT) (13%), lactato-deshidrogenasa (LDH) (12%), puntuación del grado de afectación pulmonar en la RXT (ExtScoreRXT) (9%), edad (9%), recuento de linfocitos (9%), proteína C reactiva (PCR) (7%), índice basado en la RNC para la presencia de opacidades pulmonares en la RXT (3%), nivel de dímero D (3%) y recuento de plaquetas (2%); y la mortalidad (abajo): edad (54%), SatO2/FiO2 (14%), recuento de linfocitos (10%), LDH (8%), índice basado en la RNC para la presencia de hepatización pulmonar en la RXT (7%), recuento de plaquetas (4%) y nivel de dímero D (3%). Aquellos con una importancia menor a 0,01 fueron excluidos del modelo. Table 4 . Metrics of the severity and mortality predictive models. Performance metrics of the internal validation performed with an unseen dataset for each of the selected severity and in-hospital mortality predictive models built with three different combinations of parameters. The Youden index was used for the optimal J o u r n a l P r e -p r o o f Guarantor of integrity of the entire study: L.M.B Pilar Calvillo-Batllés, Leonor Cerdá-Alberich 8. Manuscript editing Pilar Calvillo-Batllés CT Imaging Features of 2019 Novel Coronavirus (2019-nCoV) Chest CT Findings in Coronavirus Disease-19 (COVID-19): Relationship to Duration of Infection COVID-19) Performance of Radiologists in Differentiating COVID-19 from Non-COVID-19 Viral Pneumonia at Chest CT Coronavirus Disease 2019 (COVID-19): A Systematic Review of Imaging Findings in 919 Patients Clinical and Chest Radiography Features Determine Patient Outcomes in Young and Middle-aged Adults with COVID-19 Model-based Prediction of Critical Illness in Hospitalized Patients with COVID-19 Chest radiograph at admission predicts early intubation among inpatient COVID-19 patients COVID-19 outbreak in Italy: experimental chest X-ray scoring system for quantifying and monitoring disease progression Association of radiologic findings with mortality of patients infected with 2019 novel coronavirus in Wuhan Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal Wellaerated Lung on Admitting Chest CT to Predict Adverse Outcome Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): The TRIPOD Statement A Tool to Assess the Risk of Bias and Applicability of Prediction Model Studies Modified Chest X-Ray Scoring System in Evaluating Severity of COVID-19 Patient in Dr Soetomo General Hospital Surabaya, Indonesia A survey on addressing highclass imbalance in big data Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach COVID-19 patients and the radiology department -advice from the European Society of Radiology (ESR) and the European Society of Thoracic Imaging (ESTI) Value of initial chest radiographs for predicting clinical outcomes in patients with severe acute respiratory syndrome A correlation between the severity of lung lesions on radiographs and clinical findings in patients with respiratory syndrome Mild versus severe COVID-19: Laboratory markers Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: a retrospective cohort study Risk Factors Associated With Acute Respiratory Distress Syndrome and Death in Patients With Coronavirus Disease Risk factors of critical & mortal COVID-19 cases: A systematic literature review and meta-analysis Clinical characteristics of 140 patients infected with SARS-CoV-2 in Wuhan Clinical predictors of mortality due to COVID-19 based on an analysis of data of 150 patients from Wuhan, China Obesity in patients with COVID-19: a systematic review and meta-analysis China Medical Treatment Expert Group for COVID-19. Development and Validation of a Clinical Risk Score to Predict the Occurrence of Critical Illness in Hospitalized Patients With COVID-19 Prognostic Accuracy of the SIRS, qSOFA, and NEWS for Early Detection of Clinical Deterioration in SARS-CoV-2 Infected Patients National Early Warning Score 2 (NEWS2) on admission predicts severe disease and inhospital mortality from Covid-19 -a prospective cohort study Characteristics of and Important Lessons From the COVID-19) Outbreak in China: Summary of a Cases From the Chinese Center for Disease Control and Prevention Presenting Characteristics, Comorbidities, and Outcomes Among 5700 Patients Hospitalized With COVID-19 in the New York City Area Sample size considerations for the external validation of a multivariable prognostic model: a resampling study Average +/-sd (range) Clinical and laboratory variables investigated as potential predictors. Gastrointestinal symptoms: diarrhea, vomiting or abdominal pain. SatO2/FiO2: Peripheral oxygen saturation /inspiratory oxygen fraction (room air or oxygen therapy). * Predictors in the definitive prognostic prediction models Low-density opacity(ies) -level 1 -254 (57,7) Consolidation/s +/-low-density opacity(ies) -level 2-100 25 (5, 7) Dementia 19 (4, 3) Current smoker 16 (3, 6) Obstructive sleep apnea 16 (3, 6) Ex-smoker 15 (3, 4) Hypothyroidism 13 (2, 9) Atrial fibrillation 13 (2, 9) Chronic obstructive pulmonary disease 9 (2) Extensive 101 (22, 9) Very extensive 76 (17,3) Punctuation (ExtScoreCXR) 3,3 +/-3,07 (0-12) Table 3 . Lung involvement on CXR, distribution and extension.CXR features investigated as potential predictors. ExtScoreCXR: Extent score of lung involvement on CXR. * Predictor in the final prognostic prediction models.