key: cord-0291536-w1o15hgt authors: Akbar, M. N.; Wang, X.; Erdogmus, D.; Dalal, S. title: Continuous Severity Assessment Of Pulmonary Edema On Chest X-ray Using Siamese Convolutional Networks date: 2022-02-10 journal: nan DOI: 10.1101/2022.02.09.22270763 sha: 033346c31c0189b538b08834da3e252f1ba80cef doc_id: 291536 cord_uid: w1o15hgt For physicians to take rapid clinical decisions for patients with congestive heart failure, the assessment of pulmonary edema severity in chest radiographs is vital. While deep learning has been promising in detecting the presence or absence, or even discrete grades of severity, of such edema, prediction of the continuous-valued severity yet remains a challenge. Here, we propose PENet, a deep learning framework to assess the continuous spectrum of pulmonary edema severity from chest X-rays. We present different modes of implementing this network, and demonstrate that our best model outperforms that of earlier work (mean area under the curve of 0.91 over 0.87, for nine comparisons), while saving training data and computation. Using medical images to assess disease severity and evaluate longitudinal changes is a routine and important task in clinical decision making. For example, in the case of COVID-19 pneumonia, chest X-ray (CXR) scoring systems are used to escalate or de-escalate care, monitor treatment efficacy, and predict subsequent intubation or death [1] . In pulmonary edema, clinical decisions for patients with acute congestive heart failure (CHF) are often based on the grades of pulmonary edema severity, rather than its mere absence or presence [2] . Reliable estimation of pulmonary edema severity is challenging, since it depends on subtle findings and inter-rater agreement among even experienced radiologists is low [3] . Given the success of deep learning in computer vision, deep neural networks (DNNs) are now regularly utilized in a diverse range of medical imaging applications [4] , [5] . Such DNN models have also been applied in CXRs to detect the presence of edema [6] , or its discrete grades of severity [7] . These discrete grades of severity do not always reflect true continuous spectrum of change, and by discretizing, we potentially lose valuable information on continuous severity assessment. Siamese convolutional networks, already well known in the field of facial and handwriting recognition [8] , have been shown recently to be effective in detecting continuous pulmonary COVID-19 severity from CXRs [9] . Inspired by this approach, our work presents PENet: a Siamese convolutional neural network to *This work was supported by Philips Research. 1 estimate the continuous scale of pulmonary edema severity in patients with CHF. To summarize our contributions, we: 1) explore weakly supervised pretraining with publicly available CXR datasets and an abnormality definition to produce continuous abnormality scores relevant to pulmonary edema without a condition specific dataset; 2) subsequently train the pretrained model with a publicly available labeled CHF dataset 1 to predict more accurate, continuous edema severity scores; 3) train a model directly on the CHF dataset, without pretraining, and demonstrate it performs similarly to the other fully trained model with pretraining. The remainder of the paper is organized as follows: Section II outlines our model development and training techniques, Section III presents our findings and observations, and Section IV summarizes our discussion. In this work, severity labels corresponding to the stages of edema from 4,839 individual frontal (either AP or PA) CXR images are extracted from their radiology reports, following [7] . Each CXR corresponds to an individual CHF patient from MIMIC-CXR [10] , and is identified in the radiology reports under four severity levels: no edema, vascular congestion (mild edema), interstitial (moderate) edema, and alveolar (severe) edema [7] . These 4,839 labeled images are then split into train (3, 354) , validation (517), and test (968) splits. As seen in Fig. 1 , two different image preparation techniques are adopted, before feeding the input CXRs to PENet. In the first, the input CXRs are resized to 336 pixels in the shorter side and then center cropped to 320x320 pixels. In the second, the input CXRs are resized to 512 pixels in the longer side, and then symmetrically zero padded on both ends of the short side. During training, the prepared images are augmented by random translation (±5 percent of height and width) and rotation (±5 degrees). Finally, during all stages of training, validation, and testing, the prepared images are also mean normalized. A convolutional Siamese neural network is used to assess the severity score of an input CXR X 1 , given an anchor (no edema) image X 2 . Both images are passed through identical parallel sub-networks f e () with shared weights. 1 https://physionet.org/content/mimic-cxr-pe-severity/1.0.1/ . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 10, 2022. ; https://doi.org/10.1101/2022.02.09.22270763 doi: medRxiv preprint NOTE: This preprint reports new research that has not been certified by peer review and should not be used to guide clinical practice. (a) Input chest X-ray in original resolution. (b) Resized and center cropped to 320x320. (c) Resized and zero padded to 512x512. Fig. 1 : A sample patient radiograph from MIMIC-CXR [10] and two separate image preparation techniques investigated. The backbone of f e () is a DenseNet121 [11] , without the last softmax layer, pretrained on ImageNet. The output of each sub-network f e () is a 1000-element long unnormalized vector, corresponding to the number of classes in ImageNet. Each of these vectors are then separately connected to a fully connected layer (FCL). A 9-element output vector from each FCL is then passed through a sigmoid layer to constrain each vector element in the min-max interval of [0,1]. Standard mathematical operations of elemental subtraction, square, summation, and square root are subsequently performed to obtain the Euclidean distance d e between f e (X 1 ) and f e (X 2 ). This scoring process of a given CXR is repeated for each anchor in a pool of k images, and the median of the scores is recorded as either the predicted abnormality, or the edema severity score, depending on the training strategy. Fig. 2 illustrates the entire process. For pretraining PENet further with domain CXR images, we make use of two large databases: CheXpert [12] and MIMIC-CXR [10] . Using the CheXpert labeler [12] on each radiology report associated with any image from either CheXpert or MIMIC-CXR, annotations of 'positive', 'negative', and 'uncertain' are generated for several pulmonary findings [9] . To create our abnormality definition with regards to the presence of pulmonary edema or not, two boardcertified radiologists were consulted. At least one positive in any of the following conditions represents an abnormality: 'lung opacity', 'lung lesion', 'consolidation', 'pneumonia', 'atelectasis', and 'edema.' If a report is negative for all the above conditions, or contains a 'no finding' as an annotation summary, that image is labeled normal. All other images are treated as uncertain, and discarded from our analysis. Three separate training strategies are adopted in this work. In the first, PENet is pretrained by weak supervision with CXR images from either CheXpert or MIMIC-CXR. Following the definition of abnormality outlined earlier, the images are first classified as either normal or abnormal. Image pairs used to train are then chosen as either both normal or both abnormal, or either of the two permutations of one normal and one abnormal, with an equal prior probability of selecting any of the four choices. For pretraining with MIMIC-CXR, care is also taken to discard CXRs of any subject that is common to the CHF dataset. Once this preliminary training completes, PENet is then subsequently trained on the actual smaller CHF training set. In the second strategy, regular PENet, the model is directly trained on CHF data, without pretraining on a larger CXR dataset first. In the third strategy (a variant of the second), equiprobable PENet, an equal prior probability for selecting an edema label from each severity, in both training and validation, is ensured by undersampling the overrepresented labels and oversampling the underrepresented labels. Given input images X 1 and X 2 , PENet calculates the Euclidean distance d e between the two subnetwork outputs f e (X 1 ) and f e (X 2 ) as where ∥.∥ 2 denotes the Euclidean norm. To train PENet, four loss functions are investigated. First is the contrastive loss, which minimizes d e between similar images, while maximizing the distance between dissimilar images. It is given by where Y a = 0 if the two images are similar (both normal or abnormal) and Y a = 1 if they are dissimilar (one normal and one abnormal), and m is the margin of dissimilarity. The weakly supervised pretraining stage uses only the contrastive loss, and m = 3 is chosen since it is the largest possible difference between any two severity levels. Second is the mean square error (MSE) loss, which is given by where Y b ∈ {0, 1, 2, 3} indicates the severity labels. Third, a Huber loss is also explored in training, since this loss is more robust to outliers Finally, a combination of contrastive and MSE loss is also investigated where α=0.5 is chosen. For the evaluation of the trained models, the Pearson correlation coefficient r is calculated as where m is the size of the test set,d e andȲ b are the mean scores of the continuous-valued severity predictions and the discrete ground truth labels, respectively. Additionally, by Fig. 2: Block diagram of PENet. The scoring process is repeated for X 2 ∈ {k} pool of anchor (no edema) images, and the median is recorded as the abnormality or the edema severity score, whichever is applicable. binning the continuous d e,i scores into binary comparison classes by thresholding, receiver operating characteristic (ROC) plots are generated and the areas under the ROC curves (AUCs) are recorded as indicators of performance. For all training and evaluation, seed values were set to 0. While pretraining, 12,800 image pairs were chosen in training, whereas 400 pairs were chosen in validation, randomly for each epoch. During subsequent or direct training on the CHF dataset, 7,200 image pairs were chosen in training, and 800 pairs were chosen in validation, randomly for each epoch. Moreover during training and validation, the inputs are processed in mini-batches of size 8. Adam is the chosen optimizer for all models, with a learning rate of 2e-5. Model weights are saved every epoch, as long as the validation loss reduces. If the validation loss plateaus or does not improve for more than 10 epochs, early stopping is enacted. For software, Python 3.8 was used, with support from libraries such as pytorch, scikit-learn, pandas, pickle, seaborn, matplotlib, etc. For hardware, a Ubuntu server fitted with an Intel processor and three Nvidia Maxwell GPUs is utilized for all experimentation purposes. In preliminary inference, the pretrained PENet models performed better with the 512x512 sized zero padded preprocessing compared to that with the 320x320 sized center cropped variant, as seen in Table I . Hence, the 512x512 preprocessing is used in all subsequent analyses. Similarly, k=16 anchor images from the respective validation sets of all models proved to be a good balance between performance and complexity, and is likewise chosen for all subsequent analyses. B. Abnormality vs. Edema Scoring Fig. 3 illustrates both the abnormality and edema scores for a sample patient CXR. While the fully trained PENet predicts a more accurate edema score, the weakly supervised model pretrained with MIMIC-CXR is able to generate a (a) Abonormality score = 1.46 (after pretraining with MIMIC-CXR [10] ). (b) Edema severity score = 1.34 (after subsequent training with CHF data [13] ). Fig. 3 : Output scores from PENet for the two-step training strategy, on a sample patient chest X-ray with ground truth severity of level 1. The fully trained model produces a more accurate prediction, but the weakly supervised pretrained model also estimates a reasonable score. reasonably close abnormality score: a desirable feature for tasks such as radiology workflow prioritization [14] . Boxplots in Fig. 4 outline the best individual performances of the four PENet variants: the two models additionally pretrained on large CXR datasets (MIMIC-CXR and CheXpert), and the two models with direct CHF dataset training. All the plots show a common linear trend of the continuousvalued predicted edema severity with the discrete ground truth edema severity. Interestingly, it is observed that the PENet pretrained on MIMIC-CXR and CheXpert datasets exhibit a similar performance as does the regular PENet without any CXR pretraining, in terms of correlation. This indicates that the computationally expensive CXR pretraining . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted February 10, 2022. ; [13] performs as well as PENet pretrained with MIMIC-CXR [10] or CheXpert [12] . Equiprobable PENet has a shift in the distribution of its predictions, with a slight overall drop in performance. might not have provided any noticeable benefit in our experimental scenario. This phenomenon may have two possible explanations. First, pretrained on CXRs or not, all models use DenseNet121 as the network backbone, which had been pretrained on ImageNet. Second, PENet works on pairs of images. Likewise, even when the actual training set is n, PENet can take N = n C 2 = n(n − 1) 2 possible combinations of pairs as inputs, where N is the size of the synthetic training set. Thus, the maximum size of N PENet can distinguish becomes a large number: about 5.6 million in our case. The ROC curves of the two best PENet variants can be plotted as seen in Fig. 5 5≤ d e ≤] for labels 0, 1, 2, and 3, respectively. As expected, both models performed almost impeccably on the task of distinguishing images spaced farthest along the level of severity (e.g. 0 vs. 3), while they struggled the most on the tasks of classifying between adjacent states (e.g. 2 vs. 3). There is no statistical difference in performance at 5% level of significance. Table II compares the performance of PENet with the results from [7] . It is observed that the regular PENet beats both the earlier ImageNet trained model, as well as the computationally heavy semisupervised model (pretrained on MIMIC-CXR, in a higher resolution), in seven out of nine AUC comparisons. The equiprobable PENet does better on the 2 vs. 3 and 0,1,2 vs. 3 comparisons, but performs slightly worse overall compared to the regular model. In the latter, the spread and accuracy of prediction for level 3 edema (fewest training samples, thus oversampled) improves, while that of level 1 (most training samples, thus undersampled) declines. In this work, we presented PENet, a Siamese convolutional network to assess the severity of pulmonary edema from chest radiographs. Using an abnormality definition and a general chest X-ray dataset, our weakly supervised model is able to assess continuous-valued abnormality scores for edema, without the need for an edema specific dataset. When subsequently, or directly, trained on an edema dataset with discrete labels, PENet can predict continuous-valued severity scores with greater accuracy. For severity score prediction, we found that pretraining appears to provide no additional benefit in this particular task, thus potentially saving valuable training samples and the need for an abnormality definition. This is likely a consequence of the large number of synthetic image pairs PENet is able to extract for training from a relatively smaller dataset. Directly trained regular (a) Two-step training: pretrained with MIMIC-CXR (r=0.76, p<0.01). (b) Direct training with CHF data (r=0.76, p<0.01). [13] has similar area under ROC curve (AUC) performance, compared to when additionally pretrained with MIMIC-CXR, with no statistical difference at 5% level of significance. PENet even outperforms the best performing semisupervised model from earlier work, using less training data and lower resolution input images. As future work, we would like to perform cross-validation, and visualize the regions of interest for PENet. Predicting covid-19 pneumonia severity on chest x-ray with deep learning Assessing and grading congestion in acute heart failure: a scientific statement from the acute heart failure committee of the heart failure association of the european society of cardiology and endorsed by the european society of intensive care medicine Improving diagnostic accuracy in assessing pulmonary edema on bedside chest radiographs using a standardized scoring approach Applications of deep learning to mri images: A survey Mapping motor cortex stimulation to muscle responses: a deep neural network modeling approach Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning Deep learning to quantify pulmonary edema in chest radiographs Siamese neural networks: An overview Automated assessment and tracking of covid-19 pulmonary disease severity on chest radiographs using convolutional siamese neural networks Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs Densely connected convolutional networks Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison Multimodal representation learning via maximization of local mutual information Smart chest x-ray worklist prioritization using artificial intelligence: a clinical workflow simulation We would like to thank Dr. Steven Horng and Dr. Seth J. Berkowitz (Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, USA) for their guidance in preparing the abnormality definition. We would also like to thank Ruizhi Liao, PhD (Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology) for providing us the edema labels.