key: cord-0309361-mp77i8j7
authors: Saporta, A.; Gui, X.; Agrawal, A.; Pareek, A.; Truong, S. Q.; Nguyen, C. D.; Ngo, V.-D.; Seekins, J.; Blankenberg, F. G.; Ng, A.; Lungren, M. P.; Rajpurkar, P.
title: Deep learning saliency maps do not accurately highlight diagnostically relevant regions for medical image interpretation
date: 2021-03-02
journal: nan
DOI: 10.1101/2021.02.28.21252634
sha: ed25c6fa3f5aad6faf9f894f972d1694ceac41de
doc_id: 309361
cord_uid: mp77i8j7

Deep learning has enabled automated medical image interpretation at a level often surpassing that of practicing medical experts. However, many clinical practices have cited a lack of model interpretability as reason to delay the use of "black-box" deep neural networks in clinical workflows. Saliency maps, which "explain" a model's decision by producing heat maps that highlight the areas of the medical image that influence model prediction, are often presented to clinicians as an aid in diagnostic decision-making. In this work, we demonstrate that the most commonly used saliency map generating method, Grad-CAM, results in low performance for 10 pathologies on chest X-rays. We examined under what clinical conditions saliency maps might be more dangerous to use compared to human experts, and found that Grad-CAM performs worse for pathologies that had multiple instances, were smaller in size, and had shapes that were more complex. Moreover, we showed that model confidence was positively correlated with Grad-CAM localization performance, suggesting that saliency maps were safer for clinicians to use as a decision aid when the model had made a positive prediction with high confidence. Our work demonstrates that several important limitations of interpretability techniques for medical imaging must be addressed before use in clinical workflows.

Deep learning has enabled automated medical imaging interpretation at a level shown to 43 surpass that of practicing experts in some settings 1-3 . While the potential benefits of 44 automated diagnostic models are numerous, lack of model interpretability in the use of 45 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252634 doi: medRxiv preprint

in the context of automated decision-making 7 . Although many DNN interpretability 49 techniques have been proposed, rigorous investigation of the accuracy and reliability of 50 these strategies is lacking and necessary before they are integrated into the clinical 51 setting 8 . 52

One type of DNN interpretation strategy widely used in the context of medical imaging is 54 based on saliency (or pixel-attribution) methods 9-12 . Saliency methods produce heat 55 maps that highlight the areas of the medical image that most influenced the DNN's 56 prediction. The heat maps help to visualize whether a DNN is concentrating on the same 57 regions of the medical image that a human expert would focus attention on for a given 58 diagnosis, rather than concentrating on a clinically irrelevant part of the medical image or 59 even on confounders in the image 13-15 . However, recent work has shown that saliency 60 methods used to validate model predictions can be misleading in some cases and may 61 lead to increased bias and loss of user trust with concerning implications for clinical 62 translation efforts 16 . 63

The purpose of this work is to perform a systematic evaluation of the most common 65 saliency method, Grad-CAM 17 , on multi-label classification models for medical imaging 66 interpretation from chest X-rays. In order to evaluate how well the saliency method 67

identifies critical areas of an image for diagnosis, we compared the saliency method 68 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. trained only on CXR images and their corresponding pathology task labels. Grad-CAM is 137 used to generate 10 heat maps for the example CXR, one for each task. Middle, there 138 are three pathologies present in this CXR (Airspace Opacity, Pleural Effusion, and 139 Support Devices). Right, a threshold is applied to the heat maps to produce binary 140 segmentations for each present pathology. b, Two board-certified radiologists were asked 141 to segment the pathologies present in the CXR as determined by the dataset's ground-142 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252634 doi: medRxiv preprint 8 truth labels. Saliency method segmentations are compared to these reference 143 segmentations to evaluate how well Grad-CAM identifies the clinically relevant areas of 144 the input CXR ("AI localization performance"). c, Three radiologists (separate from those 145 in b) were asked to segment the pathologies present in the CXR as determined by the 146 dataset's ground-truth labels. These benchmark segmentations are compared to the 147 reference segmentations to determine a human benchmark ("expert localization 148 performance"). d, The location of the pixel with the largest value was extracted from each 149 heat map. e, In addition to drawing segmentations, the benchmark radiologists were 150 asked to locate each pathology present on each CXR using only a single point on that 151

Evaluating the localization performance of the saliency method 154

We used two evaluation schemes to compare the AI localization performance to the 155 expert localization performance. First, we used mean Intersection over Union (mIoU) to 156 measure how much, on average, either the saliency method segmentations or the human 157 benchmark segmentations overlapped with the reference segmentations. Second, we 158 used the pointing game setup 29 , in which a "hit" is when the single point used to locate a 159 pathology lies within the reference segmentation and a "miss" is when the single point For nine of the 10 pathologies, the saliency segmentations had a lower overlap with the 165 reference segmentations than did the human benchmark segmentations (see Fig. 2b ). 166

Similarly, for nine of the 10 pathologies, the saliency method had a lower hit rate than the 167 human benchmark (see Fig. 2c ). Under both the overlap and the pointing game evaluation 168 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252634 doi: medRxiv preprint Left, IoU score is 0.257, and pointing game is a "hit" since Grad-CAM's most activated 203 pixel is inside of the reference segmentation. Right, IoU score is 0.071, and pointing game 204 is a "miss" since Grad-CAM's most activated pixel is outside of the reference In order to better understand under what circumstances the AI localization performance 216 was closer to, or further from, the expert localization performance, we first conducted a 217 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252634 doi: medRxiv preprint 13 pathological characteristic on the evaluation metric at hand. See Table 2 for coefficients 240 from the regressions under the overlap and hit rate evaluation schemes. 241

Our qualitative analysis uncovered three patterns in the saliency method segmentations 243 that were associated with lower localization performance. First, we observed that when 244 multiple instances of a single pathology are present in a CXR, instead of highlighting each 245 distinct instance of the pathology separately, the saliency method segmentation often 246 highlights one large confluent area that encompasses all of the instances (see Fig. 3c ). 247

Second, we found that saliency method segmentations tend to be significantly larger than 248 either the human benchmark or reference segmentations, and often fail to respect clear 249 anatomical boundaries (see Fig. 3d ). Correspondingly, the AI overlap coefficient for area is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252634 doi: medRxiv preprint 14 263 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252634 doi: medRxiv preprint 

Since the saliency method is highly dependent on the DNN's architecture, we conducted 302 statistical analyses to determine whether there was any correlation between the model's 303 confidence in its prediction and AI localization performance. We first ran a simple 304 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252634 doi: medRxiv preprint regression for each pathology using the model's probability output for the pathology as 305 the single independent variable and using the IoU of the AI segmentation with reference 306 segmentation as the response variable. We then performed a simple regression that uses 307 the same approach as above, but that includes all 10 pathologies. For each of the 11 308 regressions, we excluded true negative cases in order to calculate the IoU score for the 309 expert segmentations. In addition to the linear regression coefficients, we also computed 310 the Spearman correlation coefficients to capture any potential non-linear associations 311 (see Table 3 ). . We also performed analogous experiments using hit rate as the response 326 variable and found comparable results (see Supplementary Table 1) . 327 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252634 doi: medRxiv preprint 328 329

The purpose of this work was to evaluate the performance of saliency methods, which 331 are widely used in clinical practice for DNN prediction explainability. We demonstrate that 332 saliency maps are consistently worse than expert radiologists at localizing a variety of 333 pathologies on CXRs. We use qualitative and quantitative analyses to establish that AI 334 localization performance is furthest from expert localization performance in the face of 335 pathologies that have multiple instances, are smaller in size, and have shapes that are 336 more complex, suggesting that deep learning explainability as a clinical interface may be 337 less reliable and less useful when used for pathologies with those characteristics. We 338

show that model assurance is positively correlated with AI localization performance, 339 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252634 doi: medRxiv preprint which could indicate that the saliency methods are safer to use as a decision aid to 340 clinicians when the model has made a positive prediction with high confidence. Finally, 341 since IoU computes the overlap of two segmentations but pointing game hit rate better 342 captures diagnostic attention, we suggest using both metrics to evaluate both AI and 343 expert localization performance. 344

Our work has several potential implications for patient care. Heat maps generated using 346 saliency methods are advocated as clinical decision support in the hope that the heat 347 maps not only improve clinical decision-making, but also encourage clinicians to trust 348 model predictions 32-34 . However, we found that AI localization performance, on balance, 349 performed worse than expert localization across multiple analyses. This is consistent with not only under what clinical conditions saliency methods might be safer to use, but also 360 how we might improve saliency methods in the future. We found that AI localization 361 performance worsens in the presence of pathologies that have multiple instances. We 362 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252634 doi: medRxiv preprint 20 also found that AI localization performance worsens in the presence of pathologies that 363 are smaller in size compared with the CXR image. This result explains why, under both 364 the overlap and hit rate schemes, the gap between AI and expert localization performance 365 was the largest for Lung Lesion, whose mean area ratio was the smallest of the 10 366 pathologies we explored. Moreover, AI localization performance under both evaluation 367 schemes was best for Enlarged Cardiomediastinum and Cardiomegaly, which had the 368 largest and third largest area ratios, respectively, of the 10 pathologies, suggesting that 369 saliency methods might be safer to use in the context of these two pathologies, or 370 pathologies with similar characteristics. Grad-CAM segmentations often fail to respect 371 clear anatomical boundaries, and we hypothesize that this is an algorithmic artifact of 372

Grad-CAM, whose feature map sized (14 x 14) heatmap is interpolated to the original 373 image dimension (usually 2000 x 2000) , resulting in coarse resolution. We also found that 374 AI localization performance worsens in the presence of pathologies whose shapes are 375 more complex. AI localization for Pneumothorax and Support Devices, both of which were 376 more elongated and complex than any of the other conditions, underperformed compared 377 to expert localization performance; however, this performance gap must also be 378 considered in the context of the model training data prevalence, and future work may 379 explore the impact of training data prevalence on the localization performance of saliency 380

While IoU is a commonly used metric for evaluating semantic segmentation outputs, there 383 are inherent limitations to the metric in the pathological context. This is indicated by our 384 finding that even the expert segmentations had relatively low overlap with the reference 385 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252634 doi: medRxiv preprint segmentations (the highest expert mIoU was 0.712 for Cardiomegaly). One potential 386 explanation for this consistent underperformance is that pathologies can be hard to 387 distinguish, especially without clinical context. Furthermore, whereas many people might 388 agree on how to segment, say, a cat or a stop sign in traditional computer vision tasks, 389

radiologists use a certain amount of clinical discretion when defining the boundaries of a 390 pathology on a CXR. There can also be institutional and geographic differences in how 391

radiologists are taught to recognize pathologies, and studies have shown that there can 392 be high interobserver variability in the interpretation of CXRs 36-38 . We sought to address 393 this with the hit rate evaluation metric, which highlights when two radiologists share the 394 same diagnostic intention, even if it is less exact than IoU in comparing segmentations 395 directly. Expert performance using hit rate was above 0.95 for four pathologies 396 (Pneumothorax, Cardiomegaly, Support Devices, and Enlarged Cardiomediastinum); 397 these are pathologies for which there is often little disagreement between radiologists 398 about where the pathologies are located, even if the expert segmentations are noisy. The 399 only pathology for which AI localization performance was better than expert localization 400 performance under the hit rate scheme was Consolidation. However, because the hit rate 401 scheme required the benchmark radiologists to select only one point on the CXR, even if 402 there were multiple instances of the pathology present (as is often the case with 403 Consolidation), it is likely that the hit rate setup unfairly penalized expert performance in 404 this case and that it is not the best evaluation metric to use for Consolidation. Further 405 work is needed to validate this hypothesis and to demonstrate which segmentation 406 evaluation metrics, even beyond overlap and hit rate, are more appropriate for which 407 pathologies when evaluating saliency methods for the clinical setting. 408 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. 

This study does not involve human subject participants. 452 453 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252634 doi: medRxiv preprint the ratio between the area of overlap and the area of union between the ground truth and 523 the predicted areas, ranging from 0-1 with 0 signifying no overlap and 1 signifying 524 perfectly overlapping segmentation. We then compared the mean Intersection over Union 525 Pathology Characteristics. We used four features to characterize the pathologies. 1. 533

Number of instances is defined as the number of disjoint components in the 534 segmentation. 2. Area ratio area is the area of the pathology divided by the total image 535 area. 3.4. Elongation and irrectangularity are geometric features that measure shape 536 complexities. They were designed to quantify what radiologists qualitatively described as 537 focal or diffused. To calculate the metrics, a rectangle of minimum area enclosing the 538 contour is fitted to each pathology. Elongation is defined as the ratio of the rectangle's 539 longer side to short side. Irrectangularity = 1 -the area of segmentation/area of enclosing 540 rectangle, with values ranging from 0 to 1 with 1 being very irrectangular. When there are 541 multiple instances within one pathology, we used the characteristics of the dominant 542 instance (largest in perimeter). 543 544 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252634 doi: medRxiv preprint Model Confidence. We used the probability output of the DNN architecture for model 545 confidence. The probabilities were normalized using max-min normalization per 546 pathology before aggregation. 547 548 Linear Regression. For each evaluation scheme (overlap and hit rate), we ran three 549 groups of simple linear regressions, with expert and AI evaluation metrics and their 550 differences as the response variables. Each group has four regressions using the above 551 four pathological characteristics as the regression's single attribute, respectively, and only 552

CXRs with a positive label were included in each regression (n=1534). All features are 553 normalized using min-max normalization so that they are comparable on scales of 554 magnitudes. We report the 95% confidence interval and p-value of the regression 555 coefficients. 556 557

CheXpert data is available at https://stanfordmlgroup.github.io/competitions/chexpert/. 559

The validation set and corresponding benchmark radiologist annotations will be 560 available online for the purpose of extending the study. 561 562

All code used to produce the results of the paper will be in a public repository for the 564 purpose of reproducing the study. The link to the code will be added to the text of the 565 paper for the camera-ready version. 566 567 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252634 doi: medRxiv preprint

Interpretability of Artificial Intelligence in Radiology: Challenges and Opportunities

Architectures for Fast Chest X-Ray Tuberculosis Screening and Visualization

Deep Inside Convolutional Networks

Visualising Image Classification Models and Saliency Maps. ArXiv13126034 Cs

Towards Trainable Saliency Maps in Medical Imaging

Quantifying Explainability of Saliency Methods in Deep Neural 600 Networks

Deep learning predicts hip fracture using confounding 602 patient and healthcare variables

Variable generalization performance of a deep learning model 604 to detect pneumonia in chest radiographs: A cross-sectional study

AI for radiographic COVID-19 607 detection selects shortcuts over signal

Stop explaining black box machine learning models for high stakes 610 decisions and use interpretable models instead

International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity

Visual Explanations from Deep Networks via

Gradient-based Localization

Deep learning approaches to biomedical image 614 segmentation

Performance of a convolutional neural network derived from 616 an ECG database in recognizing myocardial infarction

Prediction of mortality from 12-lead electrocardiogram 618 voltage data using a deep neural network

Referral for disease-related visual impairment using retinal 623 photograph-based deep learning: a proof-of-concept

Deep Learning for Predicting Refractive Error From 626

Detection of anaemia from retinal fundus images via deep 628 learning

Deep Learning to Assess Long-term Mortality From Chest Radiographs | 630

CheXaid: deep learning assistance for physician diagnosis of 633 tuberculosis using chest x-rays in patients with HIV

International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity

Human-computer collaboration for skin cancer recognition

AppendiXNet: Deep Learning for Diagnosis of Appendicitis 637 from A Small Dataset of CT Exams Using Video Pretraining

Deep-learning-assisted diagnosis for knee magnetic resonance 640 imaging: Development and retrospective validation of MRNet

CheXpert: A Large Chest Radiograph Dataset with Uncertainty 643

Labels and Expert Comparison

Top-down Neural Attention 645 by Excitation Backprop. ArXiv160800507 Cs

Changes in cancer detection and false-positive recall in 647 mammography using artificial intelligence: a retrospective, multireader study

DLBCL-Morph: Morphological features computed using deep 650 learning for an annotated digital DLBCL image set

Impact of Deep Learning Assistance on the Histopathologic 652 Review of Lymph Nodes for Metastatic Breast Cancer

Deep Learning for the Digital Pathologic Diagnosis of 655

Evaluating the Impact of a Web-656 based Diagnostic Assistant

International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted

Deep Learning-Assisted Diagnosis of Cerebral Aneurysms Using 658 the HeadXNet Model

Network output 660 visualization to uncover limitations of deep learning detection of pneumothorax

Medical Imaging 2020: Image Perception, Observer Performance, and Technology 662

11316 113160O (International Society for Optics and Photonics

Interobserver Variability in the Radiographic Diagnosis of 665

Adult Outpatient Pneumonia

Disagreements in Chest Roentgen Interpretation

Interobserver Reliability of the Chest Radiograph in 669

Assessing the validity of saliency maps for abnormality 671 localization in medical imaging

Evaluation and 673 Comparison of CNN Visual Explanations for Histopathology

Assessing the (Un)Trustworthiness of Saliency Maps for Localizing 675 Abnormalities in Medical Imaging

Axiomatic Attribution for Deep Networks

Weakly Supervised

International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted

Convolutional Networks. ArXiv160806993 Cs

A Method for Stochastic Optimization

A Threshold Selection Method from Gray-Level Histograms

International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity

contoured. For Support Devices, radiologists were asked to contour any implanted or 471 invasive devices including pacemakers, PICC/central catheters, chest tubes, 472 endotracheal tubes, feeding tubes and stents and ignore ECG lead wires or external 473 stickers visible in the chest X-ray. Finally, of the 14 observations labeled in the CheXpert 474 dataset, Fracture, Pleural Other, and Pneumonia were not segmented because they 475 either had low prevalence and/or ill-defined boundaries unfit for segmentation. 476 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. each of the 14 observations. In case of availability of more than one view, the models 493 output the maximum probability of the observations across the views. Each chest X-ray 494 was resized to 320×320 pixels and normalized before it was fed into the network. The 495DenseNet121 model architecture 46 was used. Cross-entropy loss was used to train the 496 model. The Adam optimizer 47 was used with default β-parameters of β1 = 0.9 and β2 = 497 0.999, and the learning rate was fixed at 1 × 10−4 for the duration of the training. Batches 498 were sampled using a fixed batch size of 16 images. 499 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252634 doi: medRxiv preprint

There are no competing interests. 569 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252634 doi: medRxiv preprintThe copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252634 doi: medRxiv preprint