key: cord-0608710-5lknfm7o authors: Cohen, Joseph Paul; Brooks, Rupert; En, Sovann; Zucker, Evan; Pareek, Anuj; Lungren, Matthew P.; Chaudhari, Akshay title: Gifsplanation via Latent Shift: A Simple Autoencoder Approach to Progressive Exaggeration on Chest X-rays date: 2021-02-18 journal: nan DOI: nan sha: d1c08dc301fb2d983efbe86a995c3e2b8aacbd2b doc_id: 608710 cord_uid: 5lknfm7o Motivation: Traditional image attribution methods struggle to satisfactorily explain predictions of neural networks. Prediction explanation is important, especially in the medical imaging, for avoiding the unintended consequences of deploying AI systems when false positive predictions can impact patient care. Thus, there is a pressing need to develop improved models for model explainability and introspection. Specific Problem: A new approach is to transform input images to increase or decrease features which cause the prediction. However, current approaches are difficult to implement as they are monolithic or rely on GANs. These hurdles prevent wide adoption. Our approach: Given an arbitrary classifier, we propose a simple autoencoder and gradient update (Latent Shift) that can transform the latent representation of an input image to exaggerate or curtail the features used for prediction. We use this method to study chest X-ray classifiers and evaluate their performance. We conduct a reader study with two radiologists assessing 240 chest X-ray predictions to identify which ones are false positives (half are) using traditional attribution maps or our proposed method. Results: We found low overlap with ground truth pathology masks for models with reasonably high accuracy. However, the results from our reader study indicate that these models are generally looking at the correct features. We also found that the Latent Shift explanation allows a user to have more confidence in true positive predictions compared to traditional approaches (0.15$pm$0.95 in a 5 point scale with p=0.01) with only a small increase in false positive predictions (0.04$pm$1.06 with p=0.57). Accompanying webpage: https://mlmed.org/gifsplanation Source code: https://github.com/mlmed/gifsplanation It is important to understand why a neural network model is making a prediction to ensure that it is using features that we would expect as well as discovering what unknown features a model is using. Typically 2D attribution maps are used which are based on a 1st order approximation of the neural network (Simonyan et al., 2014) but these have limitations as they may just represent edges (Adebayo et al., 2018) or simply not indicate the features that are really being used (Viviano et al., 2020; Arun et al., 2020a,b) . Recently, the idea to visualize predictions via exaggerating features that change the predictions of a model has been discussed by Singla et al. (2020 Singla et al. ( , 2021 . This exaggeration is the result of a neural network's ability to hallucinate features (Cohen et al., 2018) which is known to be controllable (Mirza and Osindero, 2014; . Instead of simply generating images of a specific class, these exaggeration methods can explain the specific features used by a classifier to make each prediction. This is valuable in detecting when a model predicts using incorrect spurious correlates to ensure it is right for the right reasons (Ross et al., 2017; Zech et al., 2018) . While most image pathology prediction models have expected causal relationships where specific image regions explicitly lead to the classification label, models predicting future risk do not have such a causal relationship. In these scenarios, it is valuable to learn which features are being used in a well performing model where the correct features are not known, such as when predicting "5 year mortality". However, there are two major downsides to existing approaches to this task. 1) They are based on GANs (Goodfellow et al., 2014) which can be very difficult and time consuming to train because of loss function stability and hyperparameter sensitivity. 2) They are monolithic models that require the generative and discriminative components to be trained together which prevents working with existing pretrained models. One would prefer an approach which is modular, as simple as possible to implement, and able to work with any existing classifier as a drop in replacement for gradient based attribution maps. Our approach requires a latent variable model, such as a simple autoencoder D(E(x)) where E is the encoder and D is the decoder, and a classifier f which predicts a target y as follows: y = f (x). The latent variable model and the classifier are trained independently without any special considerations except for being differentiable. We specifically use an autoencoder because it is simple to implement and train and we believe this will increase adoption of this method. Once these models are trained, an explanation can be computed as follows. An input image x is encoded using E(x) producing a latent representation z. Perturbations of the latent space are computed for a classifier f in Eq 1 which is then used to produce λ-shifted samples shown in Eq 2. The image x λ now is expected to produce a higher prediction such that f (x λ ) > f (x). From here we can generate multiple x λ images to exaggerate or remove features which result in a prediction (explored in §4.2). These images can be stitched together into short videos (gifs) that help to explain why a prediction was made and what representation the classifier had about the concept. Examples available online 1 . An overview of this method is shown in Figure 1 . With this approach it is important to keep in mind that this method is limited by the latent representation of the autoencoder. If the decoder is not expressive enough then it will not be able to correctly represent the features used by the classifier. Fortunately, this approach allows multiple classifiers to be compared with a fixed autoencoder (or the choice of latent variable model) and allows a clear understanding about the different representations between the models. In essence we want the exact opposite of an adversarial attack. If we were just modifying the image using the gradient ∂f (x) ∂x , which is a traditional adversarial attack, the modification would be imperceivable and distort the image by selecting spurious pixels which happen to have an impact on the target variable. Our approach regularizes this process using a fixed decoder to keep the image on the data manifold and prevent these spurious pixels from changing. Overall, we seek to modify only the most semantically meaningful pixels that lead to a particular classification output. The contributions of our work follow: 1. Propose a simple and elegant approach to build an exaggeration model as well as a way to calculate a replacement for a traditional 2D attribution map. 2. Explore the attribution of chest X-ray predictions using this method compared to traditional methods in terms of IoU overlap with expert masks and cascading randomization analysis. 3. Study how this method impacts a radiologist's ability to interpret the prediction of a model compared to traditional attribution methods when presented with false positive predictions. The idea of decoupling models was raised before and these approaches are similar in spirit to our approach in how they walk around the latent space although they have different formulations and utilize GANs. Schutte et al. (2020) learned a small function to map the latent variable to a predicted target and use it to transform the latent variable. Joshi et al. (2018) moves in the latent space based on the classifiers loss function in order to change the class of the image. They recursively modify the latent variable until the class changes. Three DenseNet121-based classifiers from existing publications were used. There is no requirement for this specific architecture but there are not many publicly available chest Xray models. Two models are from the paper (Cohen et al., 2020a) referred to as the XRV-all and XRV-mimic ch. The XRV-all model is jointly trained on 7 CXR datasets (NIH, PC, CheX, MIMIC-CXR, Google, OpenI, RSNA which are described in Appendix §A). The XRV-mimic ch model is trained on only MIMIC-CXR (Johnson et al., 2019) . The other model is from the JF Healthcare group (Ye et al., 2020) which was built for the CheXpert challenge (Irvin et al., 2019) and at one point was ranked 1st on the leaderboard. There are a few ways to generate a 2D Latent Shift attribution map which would be comparable to a typical attribution map. Here we will discuss the latentshift-max method which was found to work best. This method takes a sequence of x λ images between a specific λ range (discussed in §4.2). First the absolute difference between the non-shifted reconstruction x and each of the shifted x λ images is computed. Then the maximum difference at a per pixel level is computed to produce the final attribution map. Intuitively, this captures the maximum change as the result of the shift. More options for this conversion are discussed in Appendix §B. The baseline method of input gradients (referred to as grad) computes the absolute gradient of the input with respect to the prediction made for all images of the positive class | ∂ŷ 1 ∂x | (Simonyan et al., 2014) . The method Guided Backprop (Springenberg et al., 2015) (referred to as guided ) tries to ignore gradients that cancel each other out by only backpropagating positive gradients. The method Integrated Gradients (Sundararajan et al., 2017 ) (referred to as integrated ) works by integrating gradients between the input image x i and an all-zero baseline image. Expert mask annotations were used to evaluate attribution maps. Bounding boxes from the NIH dataset (Wang et al., 2017) were used for Atelectasis, Cardiomegaly, Effusion, and Mass. Segmentation masks from the RSNA Pneumonia Challenge (Shih et al., 2019) were used for Lung Opacity. Segmentation masks from the SIIM-ACR Pneumothorax Challenge (Filice et al., 2020) were used for Pneumothorax. Additional details in Appendix §A.3. To fairly compute an IoU value (intersection over union; IoU(mask, img) = mask∩img mask∪img ) for the 2D attribution methods we followed (Viviano et al., 2020) where a binarized attribution map is created such that the top p percentile pixels were set to 1, where p is dynamically set to the number of pixels in the ground truth mask that it is being compared to. All source code 2 and datasets (see §A) are publicly available. The classifiers, autoencoder and their respective pre-trained weights as used in this work are available in TorchXRayVision 0.0.24 (Cohen et al., 2020b) . PyTorch 1.6.0 (Paszke et al., 2017) and Captum 0.3.0 (Kokhlikyan et al., 2020) were used for model training and feature attribution, respectively. Keeping with our goal to build the most straightforward model, a ResNet (He et al., 2016) convolutional autoencoder was used as it is able to achieve high fidelity image reconstruction and is relatively easy to implement. An elastic (squared + absolute) loss was used to capture both large and small features. This model was trained on 4 large datasets NIH, PC, RSNA, and MIMIC. The bottleneck of the autoencoder is a major variable in the quality of the explanations. In Figure 2 the bottleneck size is varied and latentshift-max images are computed using the XRV-all model to predict Cardiomegaly (an enlarged heart). Looking qualitatively at the generated image explanations and their corresponding videos we observe that a large bottleneck results in spotty changes in the region of interest but they don't appear to clearly vary the pathology. At smaller bottleneck sizes the size if the heart appears to be controlled. However, if it is too small then small features, such as the ribs, are lost. In further experiments a ResNet101 with a bottleneck size of 4608 is used. Unexpectedly we find that larger bottleneck sizes have a higher IoU but they do not result in a better explanation when viewed qualitatively. The shifted images do not appear to have a smooth transition between each other and changes appear unrelated to the pathology. This brings to question how well the IoU analysis captures the quality of these approaches. During training we find that as validation MAE decreases later in training the IoU also goes down. This indicates that the specific reconstruction error seems sufficient only initially in training. Likely towards the end of training, minimizing the small details hurts the ability to control major features of the images. See Appendix §D for more plots. When making changes to the latent representation it is important to control the extent of the change. Too little and the difference between the images won't be significant enough to change the prediction of the model. Too large and the image will become too distorted and won't represent the pathology. In Figure 3 the latent representation is varied by different λ values for three different models on four different tasks. Here the direction of the change in the latent space is defined Figure 4 : The XRV-all and jfhealthcare models make positive predictions on images for Cardiomegaly. These predictions are explained using multiple 2D attribution maps. A expert bounding box is shown for Cardiomegaly in yellow. No Gaussian blur is applied to these attribution maps. by the gradient computed for each model. We observe there is variation between how the prediction changes for each model. The smoothness here is a sign that the representation is good. Surprisingly the dynamic range of the predictions between these tasks is similar. We observed that this range is decoder specific and different decoders will have much larger or smaller dynamic ranges. When creating sequences of images we utilize a simple iterative search algorithm to determine the lower and upper λ values (see Appendix §C). The lambdas are chosen such that the prediction decreases by 50% and increases by 5%. We find the pathologies seem most clear when the image sequence removes the pathology in contrast to prior work which exaggerates it. In Figure 4 qualitative results are shown when varying the model and pathology across multiple attribution methods. One very notable difference is that this method produces a smoother attribution map without blurring. The gradient based approaches have a speckled pattern which is typically alleviated using Gaussian blur. Between the two models evaluated we can see that similar regions are highlighted but they also have distinct differences. This variability is a powerful aspect of this method because we can study the different features used between models. Here it appears that the JF Healthare model mostly looks at the right side (chest right = image left) of the heart while the XRV-all model looks at both sides. This is also confirmed by looking at the generated videos. 2D images only present a small amount of information that this method provides. Videos and images can be seen side by side at this URL 3 The different 2D attribution maps are compared based on their IoU in Table 1 . This experiment confirms that this method produces similar attributions as other methods. While two models achieve reasonable AUC scores for Pneumothorax their IoU scores are extremely low which indicates either the pathology is predicted using spurious features, the bounding boxes are wrong, or that the model is predicting using some confounding pathology. The overall low scores yet high AUC bring into question the validity of using bounding box or mask information to evaluate attribution methods. Adebayo et al. (2018) showed that even visually convincing attribution maps could be misleading and only weakly dependent on the network parameters. We replicate their proposed cascading randomization evaluation. Starting at the classifier end of the network, layer weights are randomized, and the attribution is reevaluated and the correlation computed between the resulting attribution and the original. Intuitively, one expects that the attribution should rapidly become decorrelated. As shown in Figure 5 , the correlation with the final attribution drops off most rapidly with latentshift-max. Similarly to the findings in Adebayo et al., the guided backprop method produces a very similar attribution even as a significant fraction of the model is reinitialized. The patterns for other pathologies were extremely similar and are shown along with some further details in Appendix G. We performed a reader study to determine if our method can improve the ability to detect a false positive prediction by a model as well as if the features that are changed are the correct ones. For this study we recruited two radiologists (A.J. and E.Z., with 2 years and 12 years of experience, respectively). They were presented with 240 samples chest Xray images, each having one of 6 pathologies predicted by the XRV-all model (Atelectasis, Cardiomegaly, Effusion, Lung Opacity, Mass, Pneumothorax). Examples were selected such that 50% were predicted incorrectly by the model (false positives). An incorrect prediction is defined by having a negative label and a >50% prediction by the model which was calibrated such that a 50% prediction is the operating point of the AUC curve on validation data. These samples are divided equally into two groups A and B where group A presents traditional attribution methods (Input gradients, Guided Backprop, and Integrated Gradients) and group B presents the Latent Shift method as an image and a video clip. Radiologists were asked the following questions on a 5 point Likert scale: "How confident are you in the model's prediction? (1-5)" and "Is the model looking at the correct feature? (1-5)". The primary study results are shown in Figure 6 and more details can be found in appendix §E. Overall, for true positive predictions there is a 0.15±0.95 confidence increase using the Latent Shift method (p=0.01 using the Wilcoxon signed-rank test). For false positive predictions there is a 0.04±1.06 increase which is not significant (p=0.57). We expected false positives to be scored less so these results raise concerns in overconfidence based on model predictions. Although there is the possibility that some of the ground truth labels were wrong. In the radiologist's feedback (shown in Appendix §E.1) they believed that the Latent Shift method was more intuitive and they felt it increased their confidence that the model is looking at the correct feature. They observed that this method looks at the boundaries of the abnormality. One radiologist believed that the model was using the chest tube to predict Pneumothorax instead of looking at the correct area. This observation is consistent with the IoU analysis and likely because the model input is too low resolution (224x224) to see the small features at the edge of the lung. We presented Latent Shift, a simple to implement approach to explain the predictions of models by simulating changes to the input images which increase and decrease the prediction of a classifier. Our approach is designed to be easy to implement in order to increase adoption in other domains and work with existing pre-trained classifiers. We evaluated Latent Shift and other attribution methods in how well they aligned with ground truth spatial mask information. We found very low IoU values for models with reasonably high AUCs, but with this we cannot conclude which one is in error. The results from our reader study indicate that higher IoU values are correlated with correct features. We find that the Latent Shift explanation allows a user to have more confidence in true positive predictions compared to traditional approaches. However, we also found that detecting false positive predictions was challenging, which highlights the need for a stronger radiologist-algorithm symbiosis. performed. The model is evaluated on a fixed set of 10 images which contain Cardiomegaly as indicated by NIH bounding boxes. The epoch of training is shown as the color to more fairly compare these networks which converge at different rates. We can see that a larger bottleneck produces a smaller MAE but no strong trend for IoU. In B ResNets of different depths are evaluated and no major trend is found except that potentially a ResNet151 can achieve better IoUs than a ResNet101. However the computational cost is significantly higher and makes this model harder to train. Reader 1 Some general observations would be that the new prediction method is more intuitive and for most pathologies increases the confidence that the model is looking at the feature a radiologist would look at to make the diagnosis (except for pneumothorax). There were some clear examples where the model made the correct prediction but missed salient findings (e.g., cases 199, 200-predicted mass but did not detect some large masses). Also is interesting that the model in many cases seems to look at the boundaries of an abnormality rather than the actual abnormality or everything else except the abnormality (e.g., contralateral lung) in making predictions so may be a different "interpretation" style." • Latent Shift (B) does much better than gradients (A) approach. • Within the gradients methods: Image Gradient and Guided Backprop does well, while the highlighted pixels for Integrated Gradients seem to be all over the place (i.e. not good) • There is a clear correlation between high output prediction probability and better highlighting of important pixels. • The model is really struggling with pneumothorax -both in terms of prediction and in terms of highlighting correct pixels. This goes for both method A and B. FYI, I did not "count" a resolved pneumothorax as a "positive pneumothorax". I am sure the model sometimes predicts pneumothorax just because there is a chest tube." Sanity Checks for Saliency Maps Assessing the (Un)Trustworthiness of Saliency Maps for Localizing Abnormalities in Medical Imaging. medRxiv Assessing the validity of saliency maps for abnormality localization in medical imaging PadChest: A large chest x-ray image dataset with multi-label annotated reports Distribution Matching Losses Can Hallucinate Features in Medical Image Translation On the limits of cross-domain generalization in automated X-ray prediction TorchXRayVision: A library of chest X-ray datasets and models Preparing a collection of radiology examinations for distribution and retrieval Crowdsourcing pneumothorax annotations using machine learning annotations on the NIH chest X-ray dataset Generative Adversarial Networks Deep Residual Learning for Image Recognition CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison MIMIC-CXR: A large publicly available database of labeled chest radiographs xGEMs: Generating Examplars to Explain Black-Box Models Captum: A unified and generic model interpretability library for PyTorch Chest Radiograph Interpretation with Deep Learning Models: Assessment with Radiologistadjudicated Reference Standards and Population-adjusted Evaluation Conditional Generative Adversarial Nets Automatic differentiation in PyTorch Right for the Right Reasons: Training Differentiable Models by Constraining their Explanations Using StyleGAN for Visual Interpretability of Deep Learning Models on Medical Images Augmenting the National Institutes of Health Chest Radiograph Dataset with Expert Annotations of Possible Pneumonia Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps Explanation by Progressive Exaggeration Explaining the Black-box Smoothly-A Counterfactual Approach Striving for Simplicity: The All Convolutional Net Axiomatic Attribution for Deep Networks Saliency is a Possible Red Herring When Diagnosing Poor Generalization Adversarial Defense by Latent Style Transformations ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases Weakly Supervised Lesion Localization With Probabilistic-CAM Pooling Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study We thank Joseph D. Viviano, Chin-Wei Huang, Lan Dao, Jin Long, Pranav Rajpurkar, William J Sehnert, and Levon Vogelsang for useful discussions. This research is based on work partially supported by Carestream Health and the CIFAR AI and COVID-19 Catalyst Grants. Some of the computing for this project was performed on the Sherlock cluster. We would like to thank Stanford University and the Stanford Research Computing Center for providing computational resources and support that contributed to these research results. We thank AcademicTorrents.com for making data available for our research. Autoencoder datasets: NIH,PC, RSNA, and MIMIC Classifier datasets• XRV-all: NIH, PC, CheX, MIMIC-CXR, Google, OpenI, RSNA• XRV-mimic ch: MIMIC-CXR using the CheXpert labeller • jfhealthcare: CheXpert Appendix B. 3D to 2D Construction Figure B .1: Examples of the different methods to convert the sequence of images into a 2D. It is hard to find any differences even though they are generated in unique ways.• latentshift-mean: Take the average of all x λ images.• latentshift-max: Take the max distance for each spatial location of all x λ from the image when λ = 0.• latentshift-minmax: Subtract the lowest x λ from the highest: |x λ min − x λmax |.• latentshift-sliding interval: compute the difference between each λ step and then average them together.Appendix C. Lambda Search lbound = 0 last_pred = classifier(img) while True: img' = compute_shift(img, lbound) last_pred = classifier(img') if last_pred < cur_pred or initial_pred-0.5 > cur_pred or lbound <= -1000 break last_pred = cur_pred lbound = lbound -10 Appendix F. Robustness Adebayo et al. (2018) . The test computes Spearman rank correlation between importance of pixels generated by the attribution map as the network is progressively reinitialized. Atelectasis shown in 5, other patterns are very similar. The value is computed over 40 images from the NIH dataset, error bars show standard deviation of the correlation across these images. As the latentshift-max method inherently produces an absolute value map, absolute values are taken of all attribution maps before using this method.