key: cord-0883687-xxgddr6i authors: Pyrros, Ayis; Flanders, Adam Eugene; Rodríguez-Fernández, Jorge Mario; Chen, Andrew; Cole, Patrick; Wenzke, Daniel; Hart, Eric; Harford, Samuel; Horowitz, Jeanne; Nikolaidis, Paul; Muzaffarg, Nadir; Boddipalli, Viveka; Nebhrajani, Jai; Siddiqui, Nasir; Willis, Melinda; Darabi, Houshang; Koyejo, Sanmi; Galanter, William title: Predicting Prolonged Hospitalization and Supplemental Oxygenation in Patients with COVID-19 Infection from Ambulatory Chest Radiographs using Deep Learning date: 2021-05-21 journal: Acad Radiol DOI: 10.1016/j.acra.2021.05.002 sha: 2492bfd612a555fa9860808915ddb9d366521881 doc_id: 883687 cord_uid: xxgddr6i RATIONALE AND OBJECTIVE: : The clinical prognosis of outpatients with coronavirus disease 2019 (COVID-19) remains difficult to predict, with outcomes including asymptomatic, hospitalization, intubation, and death. Here we determined the prognostic value of an outpatient chest radiograph, together with an ensemble of deep learning algorithms predicting comorbidities and airspace disease to identify patients at a higher risk of hospitalization from COVID-19 infection. METHODS: : This retrospective study included outpatients with COVID-19 confirmed by reverse transcription-polymerase chain reaction testing who received an ambulatory chest radiography between 3/17/2020 and 10/24/2020. In this study, full admission was defined as hospitalization within 14 days of the COVID-19 test for >2 days with supplemental oxygen. Univariate analysis and machine learning algorithms were used to evaluate the relationship between the deep learning model predictions and hospitalization for >2 days. RESULTS: : The study included 413 patients, 222 men (54%), with a median age of 51 years (interquartile range, 39–62 years). Fifty-one patients (12.3%) required full admission. A boosted decision tree model produced the best prediction. Variables included patient age, frontal chest radiograph predictions of morbid obesity, congestive heart failure and cardiac arrhythmias, and radiographic opacity, with an internally validated area under the curve (AUC) of 0.837 (95% CI: 0.791–0.883) on a test cohort. CONCLUSIONS: : Deep learning analysis of single frontal chest radiographs was used to generate combined comorbidity and pneumonia scores that predict the need for supplemental oxygen and hospitalization for >2 days in patients with COVID-19 infection with an AUC of 0.837 (95% confidence interval: 0.791–0.883). Comorbidity scoring may prove useful in other clinical scenarios.  Unsupervised multi-task deep learning with convolutional neural networks (CNNs) on frontal chest radiographs was able to predict many underlying patient comorbidities represented by hierarchical condition categories (HCCs) from the International Classification of Diseases, Tenth Revision, including those corresponding to diabetes with chronic complications, morbid obesity, congestive heart failure, cardiac arrhythmias, and chronic obstructive pulmonary disease. Using submitted HCC codes to train and test the CNNs, among all predicted comorbidities, the total area under the receiver operating characteristic (ROC) curve (AUC) was 0.856 (95% CI: 0.845-0.862), with individual AUCs ranging between 0.729 and 0.927.  Combining the multi-task CNN output with patient age and two standardized COVID-19 airspace disease predictors in 413 outpatients testing positive for COVID-19, a standard frontal chest radiograph predicted hospitalization of >2 days' duration and supplemental oxygenation with an ROC AUC of 0.837 (95% CI: 0.791-0.883), independent of additional clinical and laboratory data. Summary Statement: An ensemble deep learning model, predicting select comorbidities and geographicopacity scores from an ambulatory frontal chest radiograph, combined with the patient"s age, was able to predict prolonged hospitalization and supplemental oxygenation for ambulatory COVID-19 patients, with an ROC AUC of 0.837 (95% CI: 0.791-0.883). The coronavirus disease 2019 (COVID- 19) pandemic placed unprecedented demand on healthcare systems. Although many infected individuals have mild or no symptoms, some become very ill and may be hospitalized for long durations [1] . Comorbid conditions like diabetes and cardiovascular disease are associated with more severe cases of COVID-19 [2] . Unfortunately, relevant comorbidities are sometimes unknown or unrecognized by the medical provider and patient, limiting the provider"s ability to perform a proper risk assessment [3] . Currently, the extraction of comorbidity data is based on contemporaneously provided patient history, manual record review, and/or electronic health record (EHR) queries [4] , and the results are imperfect and often incomplete. The purpose of this study was to develop a deep learning algorithm that could predict the likely presence of relevant comorbidities, in combination with an algorithm to quantify opacity, from frontal chest radiographs (CXRs), and thereby enable providers to more effectively risk-stratify patients presenting with COVID-19 infection. COVID-19 infection is diagnosed with reverse transcription-polymerase chain reaction (RT-PCR) or antigen tests. In patients with limited symptoms, additional testing is often unnecessary. In patients with higher risk for severe disease or complications, however, including those presenting with more severe symptoms, chest radiography is widely used for evaluation [5] . The Centers for Medicare and Medicaid Services uses a specific subset of hierarchical condition category (HCC) codes from the International Classification of Diseases, Tenth Revision (ICD10) to model chronic disease comorbidities and their associated costs of care for value-based payment models [6] . The codes are generated through encounters with healthcare providers and recorded in administrative (billing) data. These data elements are often more reproducible and more amenable to query as compared to broader searching of EHR systems. Using a convolutional neural network (CNN) to link these categorical codes to a CXR can convert the images into useful biomarker proxies for a patient"s chronic disease burden. For instance, a high categorical prediction for HCC18 would indicate that a CXR strongly suggests diabetes with chronic complications. Multiple predictive clinical models of the course of COVID-19 infection use demographic information, clinically obtained comorbidity data, laboratory markers, and radiography [7, 8] . Radiography is used as a proxy for infection severity by quantifying the geographic extent and degree of lung opacity [7, 8] . However, we are not aware of previous models using radiographs to directly predict or quantify comorbidities that contribute to patient outcomes. We hypothesize that an ensemble CNN model derived from frontal CXRs, composed of comorbidities and geographic-opacity scores, can predict prolonged hospitalization and supplemental oxygenation of ambulatory COVID-19 patients. This study was approved by the institutional review board and was granted waivers of Health Insurance Portability and Accountability Act authorization and written informed consent. The two cohorts in this study comprise patients receiving an outpatient frontal CXR at __________, a large multi-specialty group in the suburbs of ______. The first cohort consists of 14,121 CXRs of patients from 2010 to 2019, who were enrolled in the Medicare Advantage program. These patients had CXRs for typical clinical indications, like pneumonia, chest pain, and cough. None of these patients had COVID-19 infection, because of the relatively small number of available COVID patients, making it difficult to obtain an adequate distribution of comorbidities to train a neural network This cohort was used to develop and validate a multi-task CNN to predict HCC-based comorbidities. The second cohort was seen between 3/17/2020 and 10/24/2020 and received both a CXR and a positive RT-PCR COVID-19 test in the ambulatory or immediate care setting. Some of the patients went to the emergency department after the positive RT-PCR test and some were hospitalized. The EHR clinical notes were reviewed for information regarding the reason, date of admission, treatments, and length of hospitalization in days. This cohort is called COVID+. We define "full admission" as hospitalization >2 days with supplemental oxygen. In cases of multiple positive COVID-19 RT-PCR tests, or negative and then positive tests, the first positive test was used as the reference date. Likewise, in patients with multiple outpatient CXRs, the radiograph closest to the initial positive RT-PCR was used, with only one radiograph used per subject. Patients without locally available or recent CXRs, radiographs obtained >14 days after positive RT-PCR testing, and subjects <16 years old at the time of radiography were excluded. Patients admitted for >2 days within 14 days of the RT-PCR test and 7days of chest radiography were defined as full admissions ( Figure 1 ). All radiographs were obtained conventionally with digital posteroanterior radiography; no portable radiographs were included. All CXRs were extracted from the PACS system utilizing a scripted method (SikuliX, 2.0.2) and saved as de-identified 8-bit grayscale portable network graphics (PNG) files. A multi-task CNN was trained on anonymized outpatient frontal CXRs from 2010 to 2019 randomly split into 80% training and 20% test data sets. The gold standard was the EHR ICD10 codes. The 80% set was trained on gender, age, and six common ICD10 HCC codes (model v23, used by the Centers for Medicare and Medicaid Services). ICD10 codes were obtained via queries of the EHR (Epic) from the transactions table. The following HCC categories were used: diabetes with chronic complications (HCC18), morbid obesity (HCC22), congestive heart failure (CHF; HCC85), specified heart arrhythmias (HCC96), vascular disease (HCC108), and chronic obstructive pulmonary disease (COPD; HCC111). In the training set, each radiographic file was a separate row, with the absence of any associated ICD10 HCC codes labeled as 0, and the presence of one or more codes labeled as 1. Binary cross-entropy was used as the objective function in PyTorch (version 1.01; pytorch.org), and the Adam optimizer [9] with a learning rate of 0.0005 to train the neural network. The learning rate was decreased by a factor of 10 when the loss ceased to decrease for 10 iterations. Random horizontal flips (20%), random affine (rotate image by max of 10 degrees), random resized crop scale (range 1.0, 1.1), ratio (range 0.75, 1.33), and random perspective (distortion scale of 0.2, with a probability 0.75) were applied for data augmentation. Image normalization was performed by using the standard PyTorch function, with the mean and standard deviation of the pixel values computed over the training set. For image resizing, we used the PIL library to downscale to 256 × 256 with the Lanczos filter. A customized CoordConv [10] ResNet34 model was pretrained on the CheXpert dataset [11] ; CoordConv allows the convolution layer access to its own input coordinates, using an extra coordinate channel. The training was performed on a Linux (Ubuntu 18.04; Canonical, London, England) with two Nvidia TITAN GPUs (Nvidia Corporation, Santa Clara, Calif), with CUDA 11.0 (Nvidia) for 50 epochs over 10.38 hours. Training used image and batch sizes of 256 × 256 and 64, respectively. All programs were run in Python (Python 3.6; Python Software Foundation, Wilmington, Del). MTL is a general framework for learning several tasks simultaneously using their shared structure [12] . In contrast to standard (single-task) learning where each task is learned independently, MTL exploits intertask relationships to improve the representation and prediction quality. MTL can be implemented using various approaches, including explicit parameter sharing and implicit parameter sharing (e.g., using nuclear norm regularization) [13] . When individual task performance improves, this is known as "positive transfer" and indicates that joint learning is superior to separate learning. In contrast, though less common, individual task performance can degrade with MTL, a "negative transfer" phenomenon [14] . MTL is a well-established approach to machine learning that has been applied in computer vision, natural language processing, and medical applications [14] , among others [15] . To quantify the geographic extent and severity of opacity of infection, we used the open-source program COVID-Net [16] , which produces two scores: one for the geographic extent and one for the severity of opacity. Both scores were normalized from 0 to 1. Clinical variables evaluated included patient age, gender and length of stay. History of COPD, diabetes, morbid obesity (body mass index [BMI] > 40), CHF, cardiac arrhythmias, or vascular disease was determined by ICD10 ambulatory billing codes. Other than the COVID-19 RT-PCR test, outpatient lab results were not used, as most were unavailable within 24 hours of the COVID-19 RT-PCR positive test. One-to-one comparison of categorical and continuous variables was done with logistic regression. The ttest was not used as many of the variables were nonparametric. The predictions for the six HCCs were compared to the COVID+ cohort billing claims to test that this model, derived from a large cohort of patients prior to the COVID-19 pandemic, had predictive power with the COVID+ cohort. The analysis used the area under the curve (AUC) of receiver operating characteristic (ROC) curves. No hypothesis testing was done. Logistic regression produced odds ratios with 95% confidence intervals (CIs). All tests were two-sided, a P value <0.05 was deemed statistically significant, and analysis was conducted in R version 4 (R Foundation for Statistical Computing, Vienna, Austria). The COVID+ cohort had 11 input features and 1 outcome: whether or not a patient had a full admission (>2 days) within 7 days of the CXR. To assess the contributions and performance of the six-variable HCC CNN model and the two-variable geographic extent and opacity severity model, separate logistic regressions and AUC curves were generated and compared, with k-fold cross-validation accuracies also calculated. The data were split into a training/validation set (70%) and a testing set (30%). Several machine learning models, including logistic regression, decision trees, random forest, XGBoost, LightGBM, and CatBoost, were developed and optimized in Python (Python 3.6) using the training/validation set [17] [18] [19] [20] [21] [22] . The model development process uses recursive feature elimination to find the optimal feature set for each model. In this approach, a single feature is removed at each step and the model is evaluated on the validation set. The quality of the fits to the data were measured using the ROC AUC. The best model was tested against the testing set for a final measure of prediction. In total, 413 patients were included in the COVID+ cohort ( Figure 1 and A set of 14,121 anonymized unique frontal CXRs (compliant with the Health Insurance Portability and Accountability Act) was used to train a CNN to predict six HCCs using ambulatory billing data. The CNN was also trained to predict the age of the patient. The mean age of the patients, at the time of the radiograph, was 66 ± 13 years, and 57% of the patients were women. First, a training set of 11,257 (80%) radiographs was used to develop the CNN, which was then tested against a randomly selected set of 2,864 (20%) radiographs. The CNN produces a probability for each predicted variable including age, diabetes with chronic complications (HCC18), morbid obesity (HCC22), CHF (HCC85), vascular disease (HCC108), and COPD (HCC111). This could be compared to the HCC data for the test cohort. For each variable, the relationship is summarized by a ROC, and results are shown in Table 2 . Because the CNN was trained on a cohort selected from all ambulatory frontal CXRs prior to 2020, we compared the HCC predictions on the COVID+ cohort to determine whether the CNN was predictive in this clinical setting (Table 2) . A representative frontal CXR from a COVID-19 patient is shown in Figure 2 , which demonstrates how the CNN analyzes the radiographs and generates the likelihoods of comorbidities. All saliency maps were generated in Python, with the integrated gradients attribution algorithm, which computes the integral of the gradients of the output prediction for the class index, with respect to the input image pixels [23] . Importantly, this technique does not modify the CNN model. We used the COVID-Net deep learning model [16] to quantify the extent and degree of opacity, which generates geographic and opacity (geographic-opacity) scores, normalized from 0 to 1. For full admission, the average geographic scores were 0.26 ± 0.01 (median = 0.22), and average opacity scores were 0.41 ± 0.12 (median = 0.39), while for those without full admission, scores were 0.21 ± 0.05 (median = 0.19) and 0.34 ± 0.08 (median = 0.31), respectively, as shown in Table 1 . The patient"s age, comorbidities predicted by the CNN, and airspace disease (geographic-opacity scores) measured by the COVID-Net deep learning model [6] were used to model the likelihood of a full admission. A development cohort (N = 216) and validation cohort (N = 73) were first used to produce models using different methods, and these were tested against a 30% test cohort (models outlined in methods). The two best methods for prediction on the development/validation cohort were logistic regression [17] (AUC 0.81) and XGBoost [20] (AUC 0.94). The XGBoost model was then used to model the remaining independent 30% test cohort. The final fit had an AUC of 0.837 (95% CI: 0.791-0.883). This model required five variables for prediction; the five variables used were age, opacity, and CNNderived HCCs for morbid obesity, CHF, and specified heart arrhythmias. In this preliminary study we developed an ensemble deep learning model to predict supplemental oxygenation and hospitalization of >2 days in outpatients testing positive for COVID-19. This model was based only on patient age and a conventional outpatient frontal CXR image obtained before admission in 413 patients and showed an AUC of 0.837 (95% CI: 0.791-0.883), with a boosted random forest method. There is a complementary benefit of the ensemble deep learning models when predicting comorbidities and predicting geographic extent and severity of opacity on CXRs, as demonstrated by comparison of the AUCs. There are numerous clinical models of outcomes in COVID-19, many focused on admitted and critically ill hospitalized patients [24] . Several models have utilized the CXR as a predictor of mortality and morbidity for hospitalized COVID-19 patients, based on the severity, distribution, and extent of lung opacity present [24] . ICD10 administrative data have similarly been used to effectively predict mortality in COVID-19 patients [25] . To our knowledge, no published studies have utilized features of the CXR other than those related to airspace disease to make a prognostic prediction in COVID-19. This deep learning technique adds value when assessing patients with unknown medical history or awaiting laboratory testing. A significant number of COVID-19 patients demonstrate little to no abnormal lung opacity on initial radiographic imaging [26] , and comorbidity scoring is beneficial in these patients when infection might be in the early stages. Additionally, comorbidity scoring could be helpful in identifying patients who could benefit from earlier initiation of treatment such as antibody therapy or close clinical surveillance. Even before advanced deep learning techniques, CXRs have been shown to correlate with the risk of stroke, vascular resistance, and atherosclerosis through the identification of aortic calcification [27] [28] [29] . Similar deep learning methods were used on two large sets of frontal CXRs and demonstrated predictive power for mortality [29] . In another study [30] , deep learning was used on a large public data set to train a model to predict age from a frontal CXR. These studies all suggest that the CXR can serve as a complex biomarker. Our deep learning techniques allowed us to make predictions regarding the probabilities of comorbidities such as morbid obesity, diabetes, CHF, arrhythmias, vascular disease and COPD. Although these results do not replace traditional diagnostic methods (i.e., glycated hemoglobin, BMI), we did find that they were predictive using the gold standard of ICD10 HCC administrative codes, with all the AUC confidence intervals well above 0.5, demonstrating a predictive value. When tested on an entirely different cohort, the COVID+ patients, the prediction similarly demonstrated AUCs well above 0.5, even in those with low opacity scores. Lastly, when using binary classification logistic regression from the combination of CNN models on the COVID+ cohort, we see a "lift" of the combined model AUC, with a statistically significant difference (P = 0.001). These HCC predictions were combined with quantitative predictions on lung opacity and patient age, and the resulting model had discriminatory ability in predicting which ambulatory patients would require full admission within 14 days of the positive RT-PCR test. Interestingly, morbid obesity was not significant in the univariate analysis but was significant in the multivariate analysis, suggesting Simpson"s paradox, where the correlation is changed when the variables are combined [31] . The final model used three of the HCC predictions to help classify patients, and thus, even if the predictions are not entirely mapped to a comorbidity, the resulting measurement strongly correlates and has significance for hospitalization due to COVID-19. In this study on ambulatory patients, a minority had timely and complete laboratory assessments, unlike hospitalized patients who typically undergo extensive testing at presentation, limiting our ability to use laboratory results. As Schalekamp et al. described, many of these laboratory markers are not widely available or are expensive [8] . The use of comorbidity indices derived from frontal CXRs has many potential benefits. In many acute clinical settings, comorbidities may be undocumented or unknown at the time of presentation [3] . The amount of time needed to take a full history can sometimes be an impediment, especially with high patient volumes seen during the pandemic. Since a CXR is a frequent part of the initial assessment of a COVID-19 patient, the predicted comorbidity scores could be rapidly available for all patients. Additionally, the EHR provides a predominately binomial system of documenting disease (present or absent), but not all patients have the same burden of disease, as in diabetes for instance. It is possible that a model like this one could help distinguish these differences. Many of the models used for the prognosis of COVID-19-positive patients look at patients already hospitalized and attempt to predict clinical deterioration, intensive care unit admission, death, or some combination. When internally tested on their own data, the AUCs reported in the literature can be over 0.9 [32] but often do not use an independent test set, which increases the risk of overfitting and obtaining an artificially high AUC. Numerous models have AUCs lower than 0.8 [33] , with most models ranging between 0.8 and 0.9 [25] . These models are often dependent on the EHR data to make their predictions, but missing data from the EHR can adversely model predictions, meaning a model from a single source like CXRs can add significant value [34] . Although we cannot directly compare our study to these others because of the differences in patient populations and study design, we believe our predictive power is comparable with those of other prognostic studies, and our model uses only the patient age and the information in the frontal CXR. Our study was limited to our integrated healthcare system and its geographical distribution and is only internally validated at this time. In many cases, we had limited access to the patient's complete hospitalization records and laboratory assessments, restricting endpoint analysis. In the early parts of the pandemic, many patients underwent computed tomography chest imaging in lieu of CXR, because of the limited availability of RT-PCR testing. Additionally, many patients had imaging at other locations, which were not available in this study. Artificial intelligence models typically demonstrate poorer performance when used in other settings, due to different patient demographics and equipment. No portable radiographs were used in the training or testing of our model, which could limit its use in emergency departments and hospitals. Lastly, implementation of artificial intelligence models remains a technical challenge at most institutions and practices, with relatively few available platforms or widespread adoption. We found that a MTL deep learning model of comorbidities and geographic extent and severity of opacity was predictive of prolonged hospitalization and supplement oxygenation based on a single outpatient frontal CXR. This result suggests that further validation and extension of this particular methodology is warranted. representing higher scores from the multi-task comorbidity HCC model: morbid obesity (HCC22; B), CHF (HCC85; C), cardiac arrhythmias (HCC96; D). Much of the activation seen is outside the lung parenchyma, with notable activation of the axillary soft tissue for obesity (B), and heart for CHF and cardiac arrhythmias (C, D). The activations for CHF and cardiac arrhythmias are very similar, but demonstrate subtle differences, with slightly greater activation at the left atrium and aortic knob (D), likely suggesting the associations of vascular disease and atrial fibrillation. BMI = body mass index, CHF = congestive heart failure, COVID-19 = coronavirus disease 2019, HCC = hierarchical condition category. Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. The Lancet Presenting characteristics, comorbidities, and outcomes among 5700 patients hospitalized with COVID-19 in the New York City area Prevalence of comorbidities and its effects in patients infected with SARS-Cov-2: a systematic review and meta-analysis Pocket change: a simple educational intervention increases hospitalist documentation of comorbidities and improves hospital quality performance measures. Qual Manag Health Care The role of initial chest X-ray in triaging patients with suspected COVID-19 during the pandemic A review on methods of risk adjustment and their use in integrated healthcare systems Clinical and chest radiography features determine patient outcomes in young and middle-aged adults with COVID-19 Modelbased prediction of critical illness in hospitalized patients with COVID-19 Adam: a method for stochastic optimization An intriguing failing of convolutional neural networks and the CoordConv solution A large chest radiograph dataset with uncertainty labels and expert comparison Multitask learning: A knowledge-based source of inductive bias Convex multi-task feature learning Multitask learning and benchmarking with clinical time series data. Scientific Data A unified architecture for natural language processing: deep neural networks with multitask learning A tailored deep convolutional neural network design for detection of COVID-19 cases from chest x-ray images Applied Logistic Regression A survey of decision tree classifier methodology Classification and regression by randomForest. R News XGBoost: a scalable tree boosting system LightGBM: A highly efficient gradient boosting decision tree CatBoost: unbiased boosting with categorical features Axiomatic attribution for deep networks Prediction models for diagnosis and prognosis of COVID-19: systematic review and critical appraisal Development and validation of a 30-day mortality index based on pre-existing medical administrative data from COVID-19 patients: The Veterans Health Administration COVID-19 (VACO) Index. PLoS One Time course of lung changes at chest CT during recovery from coronavirus disease 2019 (COVID-19) Relationship between aortic arch calcification, detected by chest x-ray, and renal resistive index in patients with hypertension Association of aortic knob calcification with intracranial stenosis in ischemic stroke patients Deep learning to assess longterm mortality from chest radiographs Age prediction using a large chest x-ray dataset The Simpson"s paradox unraveled Development and validation of prognosis model of mortality risk in patients with COVID-19 Clinical characteristics, associated factors, and predicting COVID-19 mortality risk: A retrospective study in Wuhan, China The value of missing information in severity of illness score development