key: cord-0722590-3cghau95 authors: Li, Matthew D.; Arun, Nishanth Thumbavanam; Gidwani, Mishka; Chang, Ken; Deng, Francis; Little, Brent P.; Mendoza, Dexter P.; Lang, Min; Lee, Susanna I.; O’Shea, Aileen; Parakh, Anushri; Singh, Praveer; Kalpathy-Cramer, Jayashree title: Automated Assessment and Tracking of COVID-19 Pulmonary Disease Severity on Chest Radiographs using Convolutional Siamese Neural Networks date: 2020-07-22 journal: Radiol Artif Intell DOI: 10.1148/ryai.2020200079 sha: 42a4207bc7fa90370ce3bda4627b51593feb3c14 doc_id: 722590 cord_uid: 3cghau95 PURPOSE: To develop an automated measure of COVID-19 pulmonary disease severity on chest radiographs (CXRs), for longitudinal disease tracking and outcome prediction. MATERIALS AND METHODS: A convolutional Siamese neural network-based algorithm was trained to output a measure of pulmonary disease severity on CXRs (pulmonary x-ray severity (PXS) score), using weakly-supervised pretraining on ∼160,000 anterior-posterior images from CheXpert and transfer learning on 314 frontal CXRs from COVID-19 patients. The algorithm was evaluated on internal and external test sets from different hospitals (154 and 113 CXRs respectively). PXS scores were correlated with radiographic severity scores independently assigned by two thoracic radiologists and one in-training radiologist (Pearson r). For 92 internal test set patients with follow-up CXRs, PXS score change was compared to radiologist assessments of change (Spearman ρ). The association between PXS score and subsequent intubation or death was assessed. Bootstrap 95% confidence intervals (CI) were calculated. RESULTS: PXS scores correlated with radiographic pulmonary disease severity scores assigned to CXRs in the internal and external test sets (r=0.86 (95%CI 0.80-0.90) and r=0.86 (95%CI 0.79-0.90) respectively). The direction of change in PXS score in follow-up CXRs agreed with radiologist assessment (ρ=0.74 (95%CI 0.63-0.81)). In patients not intubated on the admission CXR, the PXS score predicted subsequent intubation or death within three days of hospital admission (area under the receiver operating characteristic curve=0.80 (95%CI 0.75-0.85)). CONCLUSION: A Siamese neural network-based severity score automatically measures radiographic COVID-19 pulmonary disease severity, which can be used to track disease change and predict subsequent intubation or death. The role of diagnostic chest imaging continues to evolve during the COVID-19 pandemic. According to American College of Radiology guidelines, while chest CT is not recommended for COVID-19 diagnosis or screening, portable chest radiographs (CXRs) are suggested when medically necessary (1) . The Fleischner Society has stated that CXRs can be useful for assessing COVID-19 disease progression (2) and one study found that 69% of these patients have an abnormal baseline CXR (3) . While radiographic findings are neither sensitive nor specific for COVID-19, with findings overlapping other infections and pulmonary edema, CXRs can be useful for assessing pulmonary infection severity and evaluating longitudinal changes. However, there is substantial variability in the interpretations of CXRs by radiologists, as has been demonstrated for pneumonia (4) (5) (6) . In addition, commonly used disease severity categories on chest radiographs, such as "mild," "moderate," and "severe," are challenging to reproduce as the thresholds between these categories are subjective. One possible solution to these challenges is to train a convolutional Siamese neural network to estimate radiographic disease severity on a continuous spectrum (7) . Siamese neural networks take two separate images as inputs, which are passed through twinned neural networks (8, 9) . The Euclidean distance between the final two layers of the networks can be calculated, which serves as a measure of distance between the two images with respect to the imaging features being trained on, such as disease features. If an image-of-interest is compared pairwise to a pool of "normal" images, the disease severity can be abstracted to the median of those Euclidean distances. In this study, we hypothesized that a convolutional Siamese neural network-based algorithm could be trained to yield a measure of radiographic pulmonary disease severity on frontal CXRs (pulmonary x-ray severity (PXS) score). We evaluated the algorithm performance I n p r e s s on internal and external test sets of CXRs from patients with COVID-19. We also investigated the association between the admission PXS score and subsequent intubation or death. This Health Insurance Portability and Accountability Act-compliant retrospective study was reviewed and exempted by the Institutional Review Board of Massachusetts General Hospital (Boston, MA), with waiver of informed consent. To train our model, we used a publicly available CXR data set, CheXpert, from Stanford Hospital, Palo Alto (10) , for pretraining and a CXR data set from COVID-19 positive patients for subsequent training ( Figure 1A ). Additional COVID-19 CXR datasets were assembled for model testing and analysis of longitudinal change. CheXpert contains 224,316 CXRs, with annotations for image view, which we used to filter for AP radiographs only, as suspected or confirmed COVID-19 positive patients tend to be imaged more frequently in the AP projection in emergency rooms and hospitals. CheXpert also includes a partition for training and validation, and after filtering for only AP images, the training and validation sets used for pre-training contained 161,590 and 169 images, respectively. For each image in this dataset, there are multiple radiology report-derived annotations that represent pulmonary parenchymal findings, including "lung opacity," "lung lesion," "consolidation," "pneumonia," "atelectasis," and "edema." For the purpose of creating a binary label for model pre-training, we considered any image with at least one of these annotations (labeled positive or uncertain) to have an abnormal lung label. All other images were considered to have normal lungs (irrespective of lines and tubes, cardiomegaly, and other findings). 81% of training images had abnormal lung labels (Supplemental Table 1 ). The DICOM data for these follow-up radiographs were also obtained for longitudinal analysis For DICOMs containing more than one frontal image acquisition, the standard frontal CXR image without postprocessing was selected, with the best positioning available (selected by M.D.L., postgraduate year 4 in-training radiologist). Most of these studies were in AP projection, as extracted from the DICOM metadata (Supplemental Table 2 ). Intubation and mortality data were collected from the medical record by two investigators blinded to CXR findings (A.O. and A.P., radiologists in fellowship training). We also obtained raw DICOM data for 113 consecutive admission CXRs associated with unique patients hospitalized at least in part on April 15, 2020 at a community hospital in the United States (Newton-Wellesley Hospital [Newton, MA]), from COVID-19 positive patients (confirmed by nasopharyngeal swab RT-PCR), which served as an external test set. To provide a reference standard assessment of disease severity on CXRs, we used a simplified version of the Radiographic Assessment of Lung Edema (RALE) score (11) . This grading scale I n p r e s s was originally validated for use in pulmonary edema assessment in acute respiratory distress syndrome (ARDS) and incorporates the extent and density of alveolar opacities on CXRs. The grading system is relevant to COVID-19 patients as the CXR findings tend to involve multifocal alveolar opacities (3) and many hospitalized COVID-19 patients develop ARDS (12) . In our study, we use a modified RALE (mRALE) score. Each lung is assigned a score for the extent of involvement by consolidation or ground glass/hazy opacities (0=none; 1=<25%; 2=25-50%; 3=50-75%; 4=>75% involvement). Each lung score is then multiplied by an overall density score (1=hazy, 2=moderate, 3=dense). The sum of scores from each lung is the mRALE score (examples in Supplemental Figure 1 ). Thus, a normal CXR receives a score of 0, while a CXR with complete consolidation of both lungs receives the maximum score of 24. mRALE differs from the original RALE score in that the lungs are not divided into quadrants. The same raters who assessed the COVID-19 internal test set also evaluated the 92 internal test set patients with follow-up CXRs. For each longitudinal image pair, the raters independently assigned the label: decreased, same, or increased pulmonary disease severity (see Supplemental Materials for annotator viewing conditions). The majority change label was assigned with two or more votes for one label. A convolutional Siamese neural network architecture takes two separate images as inputs, which are separately passed through identical subnetworks with shared weights (schematic in Figure 1A , see Supplemental Materials for image pre-processing details) (8, 9) . We built such a network using DenseNet121 (13) as the underlying subnetwork with initial pre-training on ImageNet, as this architecture had empirically performed well for classification tasks in the CheXpert study (10) . The Euclidean distance D w between the subnetwork outputs, G w (X 1 ) and G w (X 2 ), given image input vectors X 1 and X 2 , is calculated from the equation ( ( , ) = We used a two-step training strategy, that involves pre-training with weak labels on the large CheXpert data set using the contrastive loss function (8), followed by transfer learning to the relatively small COVID-19 training set using mean square error loss, using the assigned mRALE scores as disease severity labels. The contrastive loss function teaches the model the difference between abnormal and normal lungs, while the mean square error loss teaches the model a representation of difference in mRALE scores. Details regarding the training strategy are in the Supplemental Materials. The code is available at https://github.com/QTIM-Lab/PXSscore. For comparison, models were also trained using only the first or second training steps. After training the Siamese neural network, when two CXR images are passed through the subnetworks, the Euclidean distance calculated from the subnetwork outputs can serve as a continuous measure of difference between the two CXRs, with respect to pulmonary parenchymal findings. Thus, to evaluate a single image-of-interest for pulmonary disease severity, an image can be compared to a pool of N images without a lung abnormality (schematic in Figure 1B ). We created a pool of normal images using all cases labeled with "No I n p r e s s Finding" from the CheXpert validation set (N=12, ages 19-68 years, 7 women; Supplemental Materials). Using the Siamese neural network, the Euclidean distance is calculated between the image-of-interest and each of the N normal images, and the median Euclidean distance is calculated. This median Euclidean distance is the Pulmonary X-Ray Severity (PXS) score. We used an occlusion sensitivity approach (14) to visualize what portions of the input images were important to the Siamese neural network for calculating the PXS score. See the Supplemental Materials for details. We used Chi-square and Mann-Whitney tests, Pearson correlation (r), Spearman rank correlation (ρ), linear Cohen's kappa (κ), Fisher's exact test for odds ratios, and bootstrap 95% confidence intervals where appropriate (details in Supplementary Materials). The threshold for statistical significance was considered a priori to be P<0.05. There was no significant difference in age, sex, or mRALE scores between the training set and internal test set; patients in the external test set were significantly older than in the training and internal test sets, but there was no significant difference in sex or mRALE scores (Table 1) . For the 468 patients from the combined training and internal test sets, 134 patients were intubated or died within 3 days of hospital admission. The age and mRALE scores were significantly higher in these patients ( Table 2 ). The correlation between the mRALE scores assigned by the radiologist raters was similar in the COVID-19 datasets (r=0.84-0.88, P<0.001 in all cases; see Supplemental Materials for details). In the internal test set, the Siamese neural network-based PXS score correlated with the average mRALE score assigned, which is a measure of radiographic pulmonary disease severity (r=0.86 (95% CI 0.80-0.90), P<0.001) ( Figure 2A ). In the external test set, the PXS score also correlated with the average mRALE score assigned (r=0.86 (95% CI 0.79-0.90), P<0.001) ( Figure 2B ). Using an occlusion sensitivity map-based approach, we show that the network focuses its attention on pulmonary opacities ( Figure 2C ). Pre-training improved model performance (Table 3 ; Supplemental Materials). Of the internal test set patients with available longitudinal CXRs, according to the assigned majority vote change labels, 24 (26%), 19 (21%), and 44 (48%) of patients showed a decrease, no change, or increase in pulmonary disease severity respectively. Five patients (5%) did not receive majority votes (i.e. the three raters each voted differently; examples in Supplemental Figure 2 ) and were omitted from further analysis, which reflects subjectivity in the interpretation of heterogeneous CXRs. The inter-rater reliability between the three raters for assigning change labels was moderate (linear Cohen's κ=0.58, 0.59, 0.57). The change in PXS score between two longitudinally acquired images correlates with the majority vote change label (ρ=0.74 (95% CI 0.63-0.81), P<0.001) ( Figure 3A ). For patients labeled with decreased disease severity, 18 (75%) were associated with decreased PXS score. For patients labeled for increased disease severity, 43 (98%) were associated with increased PXS score. For patients labeled for no change, the mean PXS score change is 0.1 (standard I n p r e s s deviation ± 1.3). Illustrative examples of longitudinal change assessment are shown in Figure 3B . In cases labeled for no change but with an PXS score absolute change >1, variations in inspiratory effort and positioning seem to account for the PXS change (examples shown in Supplemental Figure 3 ). The PXS score was significantly higher on admission CXRs of patients with COVID-19 who were intubated or dead within 3 days of admission from our training and internal test sets, compared to those who were not intubated (median PXS score 7.9 versus 3.2, P<0.001) ( Figure 4A ). Importantly, the PXS score algorithm is not trained on outcomes data. Of the 134 patients who were intubated or died within 3 days of admission, 76 were intubated or died on the admission day and 31, 12 and 15 patients on hospital days 1, 2, and 3 respectively. A higher PXS score is associated with a shorter time interval before intubation or death in these patients (ρ=0.25, P=0.004) ( Figure 4B ). Given these findings, we used the PXS score as a continuous input for prediction of intubation or death within 3 days of hospital admission. For the 437 patients without an endotracheal tube present on the admission CXR, the receiver operating characteristic area under the curve (AUC) was 0.80 (bootstrap 95% CI 0.75-0.85) ( Figure 4C ). The PXS threshold can be set at different levels to obtain different test characteristics, which also be expressed as odds ratios (Table 4 ). Front-line clinicians estimate the risk for clinical decompensation in patients with COVID-19 using a combination of data, including epidemiologic factors, comorbidities, vital signs, lab values, and clinical intuition (12, 15) . The chest radiograph can help contribute to this assessment, but manual assessment of severity is subjective and requires expertise. In this I n p r e s s study, we designed and trained a Siamese neural network-based algorithm to provide an automated measure of COVID-19 disease severity on chest radiographs in hospitalized patients, the Pulmonary X-ray Severity (PXS) score. The PXS score correlates with a manually annotated measure of radiographic disease severity in internal and external test sets, and the direction of change in PXS score for longitudinally acquired radiographs is concordant with radiologist assessment. For patients with COVID-19 presenting to the hospital with an admission chest radiograph, the PXS score can help predict subsequent intubation or death. (17), parainfluenza virus-associated infections (18) , and pediatric pneumonia (19) . A manual radiographic grading system for COVID-19 lung disease severity has been associated increased odds of intubation (20) . These studies use manually annotated features from chest imaging to predict outcomes, such as mortality, need for intensive care, and other adverse events. However, barriers to adoption of these systems include inter-rater reliability and learning curve for users. In our study, raters assessing longitudinal change showed only moderate inter-I n p r e s s rater agreement. Our automated Siamese neural network-based approach addresses these challenges. Deep learning-based algorithms have been applied to CXRs extensively, but primarily for disease detection, such as for pneumonia and tuberculosis (21, 22) , as well as for COVID-19 localization on CXR images (23) . However, due to the nature of chest radiography, there are limits to the sensitivity and specificity of this modality for COVID-19 detection (3). There is a relative paucity of research using deep learning for disease severity assessment on CXRs. Automated evaluation of pulmonary edema severity on CXRs has been explored using a deep learning model that incorporates ordinal regression of edema severity labels in training (no, mild, moderate, or severe edema) (24) . These severity labels were extracted from associated radiology reports, but are inherently noisy given the variability in interpretation of the CXRs (25, 26) . This problem of noisy labels extends beyond pulmonary edema to any disease process where there is subjectivity in interpretation. Our Siamese neural network-based approach mitigates the label noise via transfer learning on data labeled with mRALE, a more fine-grained scoring system which showed high agreement between raters in our study. In addition, pretraining of the Siamese neural network on public data with weak labels helped boost performance. There are limitations to this study. First, patients in this study were from urban areas of the United States, which may limit the external generalizability of this algorithm to other locations. However, given that the model was able to generalize to a second hospital (community hospital vs quaternary care center) with similar performance, the model seems Inter-rater correlation in assigning mRALE scores Impact of CheXpert pre-training on model performance Impact of image anchor pool size on model performance Supplemental Table 1 . Distribution of abnormal lung labels in the CheXpert image dataset. Supplemental Table 2 Full size CXR images in JPEG format from CheXpert were all resized to 320 x 320 pixels, which is within the resolution range of optimal performance for CXR binary classification tasks (27). DICOM files from the COVID-19 CXRs were all pre-processed in the same manner as in ChexPert, with image pixel array extraction using pydicom (28), followed by normalization to [0, 255], conversion to 8-bit, correction of photometric inversion, histogram equalization in OpenCV (29), and conversion to a JPEG file. These DICOMs were anonymized at the time of study export from the PACS. In the external test set CXR images, some images included a large black border around the actual radiograph, which was mostly removed using an automatic cropping algorithm in Python (border pixels with a 0 pixel value were removed). All annotators were instructed on use of mRALE and practiced on ~10 cases before annotating the complete datasets independently. There were instructed that the goal of mRALE is to grade pulmonary opacity, regardless of cause (e.g. fibrosis or pulmonary edema still presents with a lung opacity, and should be graded as such). In the overall density score, the term 'moderate' is used which is from the original RALE paper. We have interpreted it to mean anything in between hazy opacities and dense consolidation. The lung may have different densities in different parts (e.g. ~50% of the left lung shows opacities, but some is 'moderate' density and some is 'dense.' The rater decides on the predominant density to assign the score. Pleural effusions are not included in the scoring system, though concurrent "basal opacities" which may be due to atelectasis does contribute to the mRALE score. For the training set, CXR images were viewed by annotators using JPEG images preprocessed from the DICOMs on personal computers (due to convenience during the COVID-19 pandemic). For the internal and external test sets, CXR images were viewed by annotators I n p r e s s using PACS stations routinely used for clinical work in the hospital, in standard diagnostic conditions, so as to simulate the real-world radiologist work environment. For the longitudinal image pair annotations for change, CXR images were displayed as side-by-side pre-processed JPEG images to allow for convenience of comparison. We used a two-step training strategy, that involves pre-training with weak labels on the large CheXpert data set followed by transfer learning to the relatively small COVID-19 training set, as follows: Step change), D w = Euclidean distance, and m = margin) (9) . The contrastive loss function minimizes when there is a small Euclidean distance for no change and large Euclidean distance for change in class. The margin hyperparameter is empirically set to 50, which gives the maximum D w for which dissimilar image input pairs will not contribute further to the loss, helping to stabilized training. As the goal of this algorithm is to generate a measure of disease severity, we trained the convolutional Siamese neural network to maximize Euclidean distance when the input images showed a difference in labels that identify lung parenchymal abnormalities. In the CheXpert data set, there are annotations that represent pulmonary parenchymal findings, including "lung opacity," "lung lesion," "consolidation," "pneumonia," "atelectasis," and "edema." If an image had any one of these labels (marked positive or uncertain), it was assigned an abnormal lung label. If an image did not have any one of those labels, it was assigned a normal lung label. In training the network, paired CXR images were sampled from the training data and Step 2. After pre-training on ChexPert data using weak labels, we train the Siamese neural network on the 314 image COVID-19 training set using mean square error (MSE) loss. Each image pair fed to the Siamese neural network results in an output of the Euclidean distance between the final fully connected layers. This Euclidean distance is an abstraction of difference in pulmonary disease severity between the two input CXRs. The "error" of the MSE loss is the difference between the Euclidean distance and the absolute difference in the labeled mRALE scores between the two input images. The input image pairs are randomly sampled during training and validation, with 1600 and 200 image pairs sampled per epoch, respectively. For training, each input image is resized to 336 x 336 pixels followed augmentation with random rotations of ±5° and random crop of 320 x 320 pixels. For validation, each input image is resized to 336 x 336 pixels followed by a center crop to 320 x 320 pixels. This training step was also implemented using the Adam optimizer, with the same hyperparameters as the previous step, and batch sizes of 8. Early stopping was set at 7 epochs without improvement in validation loss. The model with the lowest validation loss was saved for testing evaluation. The rationale for using contrastive loss in pre-training (Step 1), but MSE loss instead in training ( Step 2), is related to the available data labels. . These calculations were all performed using the scipy and sklearn Python packages. The threshold for statistical significance was considered a priori to be P<0.05. Data visualizations were performed using the Seaborn Python package. In the 314-patient COVID-19 training set, the correlation between the assigned mRALE score of the two raters was good (r=0.87, P<0.001). In the 154-patient COVID-19 internal test set, the rank correlations of the assigned mRALE score between the three raters was similar (r=0.85, 0.87, 0.85, P<0.001 in all cases). In the 113-patient COVID-19 external test set, the rank correlations of the assigned mRALE score between the three raters was also similar (0.84, 0.86, 0.88, P<0.001 in all cases). To evaluate the impact of pre-training, we also trained a Siamese neural network model without CheXpert pre-training. We also found that this pre-training resulted in improved model performance, as demonstrated by increased Pearson and Spearman correlations on the internal test set and increased Spearman correlation on the external test set ( Table 3 ). The Pearson correlation on the external test set was essentially the same. A model trained using only abnormal versus normal lung labels derived from the CheXpert data set (weak supervision) had worse performance ( Table 3) . We empirically set the size of the pool of normal studies for comparison to N = 12, which were used as image comparisons to calculate the PXS score as described in the Methods. During model development, we found that increasing the N improves model performance, particularly for smaller Euclidean distances (i.e. PXS scores), though with diminishing improvement with larger N (e.g. N = 30 resulted in the same performance as N = 12 in the internal test set). However, model inference time increases as N increases, due to the increased number of comparisons that are made using the Siamese neural network. For any image with a CheXpert annotation (marked positive or uncertain) that represents pulmonary parenchymal findings, including "lung opacity," "lung lesion," "consolidation," "pneumonia," "atelectasis," and "edema.", it was assigned an abnormal lung label. If an image did not have any one of those labels, it was assigned a normal lung label. AP, anterior-posterior view. ACR Recommendations for the use of Chest Radiography and Computed Tomography (CT) for Suspected COVID-19 Infection | American College of Radiology The Role of Chest Imaging in Patient Management during the COVID-19 Pandemic: A Multinational Consensus Statement from the Fleischner Society Frequency and Distribution of Chest Radiographic Findings in COVID-19 Positive Patients Interobserver reliability of the chest radiograph in community-acquired pneumonia Interobserver Reliability of Radiologists' Interpretations of Mobile Chest Radiographs for Nursing Home-Acquired Pneumonia Variability in the interpretation of chest radiographs for the diagnosis of pneumonia in children Siamese neural networks for continuous disease severity evaluation and change detection in medical imaging Signature Verification Using a "Siamese" Time Delay Neural Network Dimensionality Reduction by Learning an Invariant Mapping CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison Severity scoring of lung oedema on the chest radiograph is associated with clinical outcomes in ARDS Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: a retrospective cohort study Densely Connected Convolutional Networks Visualizing and Understanding Convolutional Networks Intensive care management of coronavirus disease 2019 (COVID-19): challenges and recommendations Prediction models for diagnosis and prognosis of covid-19 infection: Systematic review and critical appraisal A chest radiograph scoring system in patients with severe acute respiratory infection: A validation study Progression of the Radiologic Severity Index predicts mortality in patients with parainfluenza virus-associated lower respiratory infections Admission chest radiographs predict illness severity for children hospitalized with pneumonia Clinical and Chest Radiography Features Determine Patient Outcomes In Young and Middle Age Adults with COVID-19 Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning. 2017 Deep learning at chest radiography: Automated classification of pulmonary tuberculosis by using convolutional neural networks Deep Learning Localization of Pneumonia Semi-supervised Learning for Quantification of Pulmonary Edema in Chest X-Ray Images Ability of Physicians to Diagnose Congestive Heart Failure Based on Chest X-Ray Acknowledgments: The authors thank Jeremy Irvin for sharing the CheXpert pre-processing script.I n p r e s s where there was no majority vote for a change label by radiologist raters (i.e. one vote for no change, one vote for worse disease, and one for better disease). In A, the PXS score showed a change of -2.8 (10.4 to 7.6), suggesting decreased lung disease severity. In B, the PXS score showed a change of +1.2 (3.9 to 5.1), suggesting slightly increased lung disease severity. Supplemental Figure 3 . Illustrative examples of the potential impact of differences in inspiratory effort and positioning on PXS score. In A and B, the paired radiographs are from the same patient acquired at different time points (from the longitudinal analysis). In both cases, the CXR from the second time point has a higher PXS score, but this appears to be at least in part due to lower lung volumes with mild atelectasis in both cases. In case A, the patient positioning is also different. The majority vote of radiologist annotators in both of these paired cases was for no change in lung disease severity between the CXRs.