key: cord-0309361-mp77i8j7 authors: Saporta, A.; Gui, X.; Agrawal, A.; Pareek, A.; Truong, S. Q.; Nguyen, C. D.; Ngo, V.-D.; Seekins, J.; Blankenberg, F. G.; Ng, A.; Lungren, M. P.; Rajpurkar, P. title: Deep learning saliency maps do not accurately highlight diagnostically relevant regions for medical image interpretation date: 2021-03-02 journal: nan DOI: 10.1101/2021.02.28.21252634 sha: ed25c6fa3f5aad6faf9f894f972d1694ceac41de doc_id: 309361 cord_uid: mp77i8j7 Deep learning has enabled automated medical image interpretation at a level often surpassing that of practicing medical experts. However, many clinical practices have cited a lack of model interpretability as reason to delay the use of "black-box" deep neural networks in clinical workflows. Saliency maps, which "explain" a model's decision by producing heat maps that highlight the areas of the medical image that influence model prediction, are often presented to clinicians as an aid in diagnostic decision-making. In this work, we demonstrate that the most commonly used saliency map generating method, Grad-CAM, results in low performance for 10 pathologies on chest X-rays. We examined under what clinical conditions saliency maps might be more dangerous to use compared to human experts, and found that Grad-CAM performs worse for pathologies that had multiple instances, were smaller in size, and had shapes that were more complex. Moreover, we showed that model confidence was positively correlated with Grad-CAM localization performance, suggesting that saliency maps were safer for clinicians to use as a decision aid when the model had made a positive prediction with high confidence. Our work demonstrates that several important limitations of interpretability techniques for medical imaging must be addressed before use in clinical workflows. Deep learning has enabled automated medical imaging interpretation at a level shown to 43 surpass that of practicing experts in some settings 1-3 . While the potential benefits of 44 automated diagnostic models are numerous, lack of model interpretability in the use of 45 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252634 doi: medRxiv preprint in the context of automated decision-making 7 . Although many DNN interpretability 49 techniques have been proposed, rigorous investigation of the accuracy and reliability of 50 these strategies is lacking and necessary before they are integrated into the clinical 51 setting 8 . 52 One type of DNN interpretation strategy widely used in the context of medical imaging is 54 based on saliency (or pixel-attribution) methods 9-12 . Saliency methods produce heat 55 maps that highlight the areas of the medical image that most influenced the DNN's 56 prediction. The heat maps help to visualize whether a DNN is concentrating on the same 57 regions of the medical image that a human expert would focus attention on for a given 58 diagnosis, rather than concentrating on a clinically irrelevant part of the medical image or 59 even on confounders in the image 13-15 . However, recent work has shown that saliency 60 methods used to validate model predictions can be misleading in some cases and may 61 lead to increased bias and loss of user trust with concerning implications for clinical 62 translation efforts 16 . 63 The purpose of this work is to perform a systematic evaluation of the most common 65 saliency method, Grad-CAM 17 , on multi-label classification models for medical imaging 66 interpretation from chest X-rays. In order to evaluate how well the saliency method 67 identifies critical areas of an image for diagnosis, we compared the saliency method 68 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. trained only on CXR images and their corresponding pathology task labels. Grad-CAM is 137 used to generate 10 heat maps for the example CXR, one for each task. Middle, there 138 are three pathologies present in this CXR (Airspace Opacity, Pleural Effusion, and 139 Support Devices). Right, a threshold is applied to the heat maps to produce binary 140 segmentations for each present pathology. b, Two board-certified radiologists were asked 141 to segment the pathologies present in the CXR as determined by the dataset's ground-142 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252634 doi: medRxiv preprint 8 truth labels. Saliency method segmentations are compared to these reference 143 segmentations to evaluate how well Grad-CAM identifies the clinically relevant areas of 144 the input CXR ("AI localization performance"). c, Three radiologists (separate from those 145 in b) were asked to segment the pathologies present in the CXR as determined by the 146 dataset's ground-truth labels. These benchmark segmentations are compared to the 147 reference segmentations to determine a human benchmark ("expert localization 148 performance"). d, The location of the pixel with the largest value was extracted from each 149 heat map. e, In addition to drawing segmentations, the benchmark radiologists were 150 asked to locate each pathology present on each CXR using only a single point on that 151 Evaluating the localization performance of the saliency method 154 We used two evaluation schemes to compare the AI localization performance to the 155 expert localization performance. First, we used mean Intersection over Union (mIoU) to 156 measure how much, on average, either the saliency method segmentations or the human 157 benchmark segmentations overlapped with the reference segmentations. Second, we 158 used the pointing game setup 29 , in which a "hit" is when the single point used to locate a 159 pathology lies within the reference segmentation and a "miss" is when the single point For nine of the 10 pathologies, the saliency segmentations had a lower overlap with the 165 reference segmentations than did the human benchmark segmentations (see Fig. 2b ). 166 Similarly, for nine of the 10 pathologies, the saliency method had a lower hit rate than the 167 human benchmark (see Fig. 2c ). Under both the overlap and the pointing game evaluation 168 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252634 doi: medRxiv preprint Left, IoU score is 0.257, and pointing game is a "hit" since Grad-CAM's most activated 203 pixel is inside of the reference segmentation. Right, IoU score is 0.071, and pointing game 204 is a "miss" since Grad-CAM's most activated pixel is outside of the reference In order to better understand under what circumstances the AI localization performance 216 was closer to, or further from, the expert localization performance, we first conducted a 217 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252634 doi: medRxiv preprint 13 pathological characteristic on the evaluation metric at hand. See Table 2 for coefficients 240 from the regressions under the overlap and hit rate evaluation schemes. 241 Our qualitative analysis uncovered three patterns in the saliency method segmentations 243 that were associated with lower localization performance. First, we observed that when 244 multiple instances of a single pathology are present in a CXR, instead of highlighting each 245 distinct instance of the pathology separately, the saliency method segmentation often 246 highlights one large confluent area that encompasses all of the instances (see Fig. 3c ). 247 Second, we found that saliency method segmentations tend to be significantly larger than 248 either the human benchmark or reference segmentations, and often fail to respect clear 249 anatomical boundaries (see Fig. 3d ). Correspondingly, the AI overlap coefficient for area is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252634 doi: medRxiv preprint 14 263 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252634 doi: medRxiv preprint Since the saliency method is highly dependent on the DNN's architecture, we conducted 302 statistical analyses to determine whether there was any correlation between the model's 303 confidence in its prediction and AI localization performance. We first ran a simple 304 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252634 doi: medRxiv preprint regression for each pathology using the model's probability output for the pathology as 305 the single independent variable and using the IoU of the AI segmentation with reference 306 segmentation as the response variable. We then performed a simple regression that uses 307 the same approach as above, but that includes all 10 pathologies. For each of the 11 308 regressions, we excluded true negative cases in order to calculate the IoU score for the 309 expert segmentations. In addition to the linear regression coefficients, we also computed 310 the Spearman correlation coefficients to capture any potential non-linear associations 311 (see Table 3 ). . We also performed analogous experiments using hit rate as the response 326 variable and found comparable results (see Supplementary Table 1) . 327 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252634 doi: medRxiv preprint 328 329 The purpose of this work was to evaluate the performance of saliency methods, which 331 are widely used in clinical practice for DNN prediction explainability. We demonstrate that 332 saliency maps are consistently worse than expert radiologists at localizing a variety of 333 pathologies on CXRs. We use qualitative and quantitative analyses to establish that AI 334 localization performance is furthest from expert localization performance in the face of 335 pathologies that have multiple instances, are smaller in size, and have shapes that are 336 more complex, suggesting that deep learning explainability as a clinical interface may be 337 less reliable and less useful when used for pathologies with those characteristics. We 338 show that model assurance is positively correlated with AI localization performance, 339 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252634 doi: medRxiv preprint which could indicate that the saliency methods are safer to use as a decision aid to 340 clinicians when the model has made a positive prediction with high confidence. Finally, 341 since IoU computes the overlap of two segmentations but pointing game hit rate better 342 captures diagnostic attention, we suggest using both metrics to evaluate both AI and 343 expert localization performance. 344 Our work has several potential implications for patient care. Heat maps generated using 346 saliency methods are advocated as clinical decision support in the hope that the heat 347 maps not only improve clinical decision-making, but also encourage clinicians to trust 348 model predictions 32-34 . However, we found that AI localization performance, on balance, 349 performed worse than expert localization across multiple analyses. This is consistent with not only under what clinical conditions saliency methods might be safer to use, but also 360 how we might improve saliency methods in the future. We found that AI localization 361 performance worsens in the presence of pathologies that have multiple instances. We 362 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252634 doi: medRxiv preprint 20 also found that AI localization performance worsens in the presence of pathologies that 363 are smaller in size compared with the CXR image. This result explains why, under both 364 the overlap and hit rate schemes, the gap between AI and expert localization performance 365 was the largest for Lung Lesion, whose mean area ratio was the smallest of the 10 366 pathologies we explored. Moreover, AI localization performance under both evaluation 367 schemes was best for Enlarged Cardiomediastinum and Cardiomegaly, which had the 368 largest and third largest area ratios, respectively, of the 10 pathologies, suggesting that 369 saliency methods might be safer to use in the context of these two pathologies, or 370 pathologies with similar characteristics. Grad-CAM segmentations often fail to respect 371 clear anatomical boundaries, and we hypothesize that this is an algorithmic artifact of 372 Grad-CAM, whose feature map sized (14 x 14) heatmap is interpolated to the original 373 image dimension (usually 2000 x 2000) , resulting in coarse resolution. We also found that 374 AI localization performance worsens in the presence of pathologies whose shapes are 375 more complex. AI localization for Pneumothorax and Support Devices, both of which were 376 more elongated and complex than any of the other conditions, underperformed compared 377 to expert localization performance; however, this performance gap must also be 378 considered in the context of the model training data prevalence, and future work may 379 explore the impact of training data prevalence on the localization performance of saliency 380 While IoU is a commonly used metric for evaluating semantic segmentation outputs, there 383 are inherent limitations to the metric in the pathological context. This is indicated by our 384 finding that even the expert segmentations had relatively low overlap with the reference 385 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252634 doi: medRxiv preprint segmentations (the highest expert mIoU was 0.712 for Cardiomegaly). One potential 386 explanation for this consistent underperformance is that pathologies can be hard to 387 distinguish, especially without clinical context. Furthermore, whereas many people might 388 agree on how to segment, say, a cat or a stop sign in traditional computer vision tasks, 389 radiologists use a certain amount of clinical discretion when defining the boundaries of a 390 pathology on a CXR. There can also be institutional and geographic differences in how 391 radiologists are taught to recognize pathologies, and studies have shown that there can 392 be high interobserver variability in the interpretation of CXRs 36-38 . We sought to address 393 this with the hit rate evaluation metric, which highlights when two radiologists share the 394 same diagnostic intention, even if it is less exact than IoU in comparing segmentations 395 directly. Expert performance using hit rate was above 0.95 for four pathologies 396 (Pneumothorax, Cardiomegaly, Support Devices, and Enlarged Cardiomediastinum); 397 these are pathologies for which there is often little disagreement between radiologists 398 about where the pathologies are located, even if the expert segmentations are noisy. The 399 only pathology for which AI localization performance was better than expert localization 400 performance under the hit rate scheme was Consolidation. However, because the hit rate 401 scheme required the benchmark radiologists to select only one point on the CXR, even if 402 there were multiple instances of the pathology present (as is often the case with 403 Consolidation), it is likely that the hit rate setup unfairly penalized expert performance in 404 this case and that it is not the best evaluation metric to use for Consolidation. Further 405 work is needed to validate this hypothesis and to demonstrate which segmentation 406 evaluation metrics, even beyond overlap and hit rate, are more appropriate for which 407 pathologies when evaluating saliency methods for the clinical setting. 408 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. This study does not involve human subject participants. 452 453 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252634 doi: medRxiv preprint the ratio between the area of overlap and the area of union between the ground truth and 523 the predicted areas, ranging from 0-1 with 0 signifying no overlap and 1 signifying 524 perfectly overlapping segmentation. We then compared the mean Intersection over Union 525 Pathology Characteristics. We used four features to characterize the pathologies. 1. 533 Number of instances is defined as the number of disjoint components in the 534 segmentation. 2. Area ratio area is the area of the pathology divided by the total image 535 area. 3.4. Elongation and irrectangularity are geometric features that measure shape 536 complexities. They were designed to quantify what radiologists qualitatively described as 537 focal or diffused. To calculate the metrics, a rectangle of minimum area enclosing the 538 contour is fitted to each pathology. Elongation is defined as the ratio of the rectangle's 539 longer side to short side. Irrectangularity = 1 -the area of segmentation/area of enclosing 540 rectangle, with values ranging from 0 to 1 with 1 being very irrectangular. When there are 541 multiple instances within one pathology, we used the characteristics of the dominant 542 instance (largest in perimeter). 543 544 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252634 doi: medRxiv preprint Model Confidence. We used the probability output of the DNN architecture for model 545 confidence. The probabilities were normalized using max-min normalization per 546 pathology before aggregation. 547 548 Linear Regression. For each evaluation scheme (overlap and hit rate), we ran three 549 groups of simple linear regressions, with expert and AI evaluation metrics and their 550 differences as the response variables. Each group has four regressions using the above 551 four pathological characteristics as the regression's single attribute, respectively, and only 552 CXRs with a positive label were included in each regression (n=1534). All features are 553 normalized using min-max normalization so that they are comparable on scales of 554 magnitudes. We report the 95% confidence interval and p-value of the regression 555 coefficients. 556 557 CheXpert data is available at https://stanfordmlgroup.github.io/competitions/chexpert/. 559 The validation set and corresponding benchmark radiologist annotations will be 560 available online for the purpose of extending the study. 561 562 All code used to produce the results of the paper will be in a public repository for the 564 purpose of reproducing the study. The link to the code will be added to the text of the 565 paper for the camera-ready version. 566 567 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252634 doi: medRxiv preprint Interpretability of Artificial Intelligence in Radiology: Challenges and Opportunities Architectures for Fast Chest X-Ray Tuberculosis Screening and Visualization Deep Inside Convolutional Networks Visualising Image Classification Models and Saliency Maps. ArXiv13126034 Cs Towards Trainable Saliency Maps in Medical Imaging Quantifying Explainability of Saliency Methods in Deep Neural 600 Networks Deep learning predicts hip fracture using confounding 602 patient and healthcare variables Variable generalization performance of a deep learning model 604 to detect pneumonia in chest radiographs: A cross-sectional study AI for radiographic COVID-19 607 detection selects shortcuts over signal Stop explaining black box machine learning models for high stakes 610 decisions and use interpretable models instead International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity Visual Explanations from Deep Networks via Gradient-based Localization Deep learning approaches to biomedical image 614 segmentation Performance of a convolutional neural network derived from 616 an ECG database in recognizing myocardial infarction Prediction of mortality from 12-lead electrocardiogram 618 voltage data using a deep neural network Referral for disease-related visual impairment using retinal 623 photograph-based deep learning: a proof-of-concept Deep Learning for Predicting Refractive Error From 626 Detection of anaemia from retinal fundus images via deep 628 learning Deep Learning to Assess Long-term Mortality From Chest Radiographs | 630 CheXaid: deep learning assistance for physician diagnosis of 633 tuberculosis using chest x-rays in patients with HIV International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity Human-computer collaboration for skin cancer recognition AppendiXNet: Deep Learning for Diagnosis of Appendicitis 637 from A Small Dataset of CT Exams Using Video Pretraining Deep-learning-assisted diagnosis for knee magnetic resonance 640 imaging: Development and retrospective validation of MRNet CheXpert: A Large Chest Radiograph Dataset with Uncertainty 643 Labels and Expert Comparison Top-down Neural Attention 645 by Excitation Backprop. ArXiv160800507 Cs Changes in cancer detection and false-positive recall in 647 mammography using artificial intelligence: a retrospective, multireader study DLBCL-Morph: Morphological features computed using deep 650 learning for an annotated digital DLBCL image set Impact of Deep Learning Assistance on the Histopathologic 652 Review of Lymph Nodes for Metastatic Breast Cancer Deep Learning for the Digital Pathologic Diagnosis of 655 Evaluating the Impact of a Web-656 based Diagnostic Assistant International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted Deep Learning-Assisted Diagnosis of Cerebral Aneurysms Using 658 the HeadXNet Model Network output 660 visualization to uncover limitations of deep learning detection of pneumothorax Medical Imaging 2020: Image Perception, Observer Performance, and Technology 662 11316 113160O (International Society for Optics and Photonics Interobserver Variability in the Radiographic Diagnosis of 665 Adult Outpatient Pneumonia Disagreements in Chest Roentgen Interpretation Interobserver Reliability of the Chest Radiograph in 669 Assessing the validity of saliency maps for abnormality 671 localization in medical imaging Evaluation and 673 Comparison of CNN Visual Explanations for Histopathology Assessing the (Un)Trustworthiness of Saliency Maps for Localizing 675 Abnormalities in Medical Imaging Axiomatic Attribution for Deep Networks Weakly Supervised International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted Convolutional Networks. ArXiv160806993 Cs A Method for Stochastic Optimization A Threshold Selection Method from Gray-Level Histograms International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity contoured. For Support Devices, radiologists were asked to contour any implanted or 471 invasive devices including pacemakers, PICC/central catheters, chest tubes, 472 endotracheal tubes, feeding tubes and stents and ignore ECG lead wires or external 473 stickers visible in the chest X-ray. Finally, of the 14 observations labeled in the CheXpert 474 dataset, Fracture, Pleural Other, and Pneumonia were not segmented because they 475 either had low prevalence and/or ill-defined boundaries unfit for segmentation. 476 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. each of the 14 observations. In case of availability of more than one view, the models 493 output the maximum probability of the observations across the views. Each chest X-ray 494 was resized to 320×320 pixels and normalized before it was fed into the network. The 495DenseNet121 model architecture 46 was used. Cross-entropy loss was used to train the 496 model. The Adam optimizer 47 was used with default β-parameters of β1 = 0.9 and β2 = 497 0.999, and the learning rate was fixed at 1 × 10−4 for the duration of the training. Batches 498 were sampled using a fixed batch size of 16 images. 499 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252634 doi: medRxiv preprint There are no competing interests. 569 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252634 doi: medRxiv preprintThe copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252634 doi: medRxiv preprint