key: cord-0561182-l61kc3b9
authors: Zhu, Hongzhi; Salcudean, Septimiu; Rohling, Robert
title: Gaze-Guided Class Activation Mapping: Leveraging Human Attention for Network Attention in Chest X-rays Classification
date: 2022-02-15
journal: nan
DOI: nan
sha: f8b540c9e4e4d2ab3f6ebd508c1f61588aa6ee0e
doc_id: 561182
cord_uid: l61kc3b9

The increased availability and accuracy of eye-gaze tracking technology has sparked attention-related research in psychology, neuroscience, and, more recently, computer vision and artificial intelligence. The attention mechanism in artificial neural networks is known to improve learning tasks. However, no previous research has combined the network attention and human attention. This paper describes a gaze-guided class activation mapping (GG-CAM) method to directly regulate the formation of network attention based on expert radiologists' visual attention for the chest X-ray pathology classification problem, which remains challenging due to the complex and often nuanced differences among images. GG-CAM is a lightweight ($3$ additional trainable parameters for regulating the learning process) and generic extension that can be easily applied to most classification convolutional neural networks (CNN). GG-CAM-modified CNNs do not require human attention as an input when fully trained. Comparative experiments suggest that two standard CNNs with the GG-CAM extension achieve significantly greater classification performance. The median area under the curve (AUC) metrics for ResNet50 increases from $0.721$ to $0.776$. For EfficientNetv2 (s), the median AUC increases from $0.723$ to $0.801$. The GG-CAM also brings better interpretability of the network that facilitates the weakly-supervised pathology localization and analysis.

Analogies and comparisons are frequently made between the artificial neural network (ANN) with its biological counterparts. Multiple explorations and innovations in ANN have been enlightened by concepts from biological neural networks, among which, the attention mechanism, popularized by [36] , is one of the most prominent research domains [12] . The essence of attention (artificially and biologically) is "the flexible control of limited computational resources", yet its precise mechanism is still pending discovery [17, 29] .

Though conceptually intertwined, the attention mechanism applied to computer vision tasks using convolutional neural networks (CNN) does not generally mirror human attention. One of the most frequently implemented attention models in CNNs, the multiplicative attention (or the dot-product attention), enables spatial and/or feature-wise attention via explicitly designed network architectures, i.e., the Squeeze-and-Excitation Network [19] and the Attention Gated Network [31] . These explicitly crafted architectures can regulate the information flow and automatically form attentions during the training process, but these attentions usually have limited interpretability. Another branch of research, the class activation mapping (CAM) [42] and its variants [7, 32, 37] , aims to unveil the specific regions and/or features in the input images that the network is attending to for specific computer vision tasks. The CAM attention methods can be applied to most generic CNN architectures, i.e., ResNet [15] and Efficient Net [34] , and the resulting attentions are presented as (ideally human interpretable) 2D attention heat maps.

In this paper, unlike previous attempts, we focus on the central question: can human attention be directly exploited to help the formulation of network attention?

To find the answer to this question, we must first quantitatively measure human attention. One candidate of such measurement is gaze tracking, which measures the position of one's eye-gaze as a time series [6, 45] . Due to tracking inaccuracies [43] and the stochastic nature of eye movement [44] , the raw tracking data are often too noisy to faithfully reflect human attention. Therefore, visual heat maps, generated by 2D clustering and smoothing of the tracked gaze positions, are more commonly employed to represent the distribution of one's visual attention [16] .

With the visual heat maps, we propose the method of gaze-guided class activation mapping (GG-CAM) that uses the visual attention heat map to supervise the formation of the CAM attention in generic CNN architectures for image classification. The proposed method, to the best of our knowledge, is the first that combines network attention with human attention. When applied to the disease classification tasks for chest X-ray (CXR) images in a public dataset [22] , GG-CAM displays three major advantages:

1. It is a light weight (only 3 additional trainable parameters) and generic method that can be easily added to most classification CNNs for the integration of human attention to network attention.

2. The performance of CNN classifications with GG-CAM can be significantly improved, and the fully trained networks no longer require visual attention as reference.

3. The interpretability of the CNNs is improved, as it can spatially relate pathology to organ location.

The rest of the paper is organized as follows. We first present the background of this paper in Section 2, where we review recent advancements on disease classification on CXR images with CNN and deep learning methods using visual attention. We also summarize the CAM method in Section 2. Next, In Section 3, we propose the GG-CAM method. To validate our method, experiments were conducted, and results are discussed in Section 4. Lastly, we conclude our paper in Section 5.

CXR imaging is one of the most frequently used medical diagnostic tools, capable of identifying multiple pathologies, i.e., pneumonia, triage, pneumothorax, tuberculosis, COVID-19, and cardiomegaly [5] . There exist many public CXR datasets, such as ChestX-ray14 [38] , CheXpert [20] , MIMIC-CXR [21] and Ped-Pneumonia [25] , that focus on single or multiple pathologies.

For datasets focusing on a single pathology, existing CNNs for classification (ResNet [15] , MobileNet [18] , etc.) can yield outstanding classification performance [26] . For example, in the pediatric pneumonia CXR dataset (Ped-Pneumonia [25] ), [14] reports a binary classification accuracy of 98.4%. However, datasets targeting multiple pathologies are more difficult for the CNNs to classify [5, 26] . One example is the MIMIC-CXR dataset [21] , where there are at least 70 different kinds of pathologies and abnormalities, multiple of which may coexist in a single CXR image [40] . The state-of-the-art CNN classifier for the MIMIC-CXR dataset [21] is achieved by [40] and [39] , which share the same CNN architecture but focus on two different classification tasks. In [39] , focus is on the binary classification task (normal versus pathological/abnormal CXR images), and the reported AUC is 0.824. In [40] , 70 pathologies and abnormalities are considered for classification, and the AUC for each class ranged from 0.628 to 0.985.

Other than the challenges arising from the complexity of CXR images, the fact that most CXR datasets are imbalanced adds an additional level of difficulty for network training and fair benchmarking [26] . To solve the problem, in [22] , a balanced dataset derived from the MIMIC-CXR dataset [21] was proposed. Images in the balanced dataset can be classified into three mutually exclusive categories: normal, pneumonia and cardiomegaly (enlarged heart). It is also a modality rich dataset that can facilitate with crossmodal learning and multi-task learning researches.

Visual heat maps, as a measurement of one's visual attention, are employed to assist with CNN training and optimization through two primary approaches. The most straightforward approach is to feed visual heat maps into the CNN together with the images, as in [3, 4, 30, 33] , because gaze patterns are task specific [23] . By doing so, the network can further process the information conveyed in the visual heat maps to enhance performance; however, this adds an input to the network, making it harder to deploy to real-world tasks. The other approach avoids the dependency on the visual heat maps through representation learning, by using the visual heat maps to facilitate the learning of representative and robust features through transfer learning [8, 9, 41] or multi-task learning [2] . Still, transfer learning methods may suffer from negative transfer or overfitting, and multi-task learning methods commonly introduce a large number additional parameters to the network.

Although existing CNN methods with visual heat maps yield improved performance, the networks are designed to processes the visual heat maps in the same manner as other information sources, leaving the embedded attention char-acteristics underexplored.

Most classification CNNs are built with the same consecutive computational blocks (shown in Figure 1a ): a feature extractor (or called backbone network), a global average pooling layer, a linear layer (or called dense layer or fully connection layer), and a softmax layer (or other normalization operation). To demystify the internal mechanisms of the CNN "black boxes", the CAM method [42] was proposed. In [42] , they focused on the global average pooling layer and the linear layer in the CNN, from which they extracted 2D attention maps, depicting regions on the input that the network is attending for the outputs. The detailed CAM method is explained next.

Let A ∈ R G×H×W be the output tensor from the feature extractor, where G stands for the number of features, H and W represent the spatial dimensions; p ∈ R G be the output tensor from the global average pooling layer; and y ∈ R C be the output tensor from the linear layer, where C is the number of classes that the network is trying to categorize. Without loss of generality, we neglect the batch dimension for tensors in our paper. Therefore, we have:

where Λ ∈ R C×G and λ ∈ R C are trainable weights and biases, respectively, in the linear layer. More specifically, the c th element in vector y, y c , is related to A through the following equation:

where A k,i,j denotes the element positioned at (k, i, j) in A. Through altering the order of summation in Equation ( 2), we have:

The CAM for the network, Ω ∈ R C×H×W , is defined as:

which is the inner most summation in Equation (3). Through Equation (4), we know that Ω has three dimensions. Let Ω c ∈ R H×W be the c th slice of Ω in the first dimension; Ω c is a 2D attention heat map that explains which regions on the input contribute to the decision of predicting class c as the CNN output. This can be better explained by combining Equations (3) and (4):

is the mean of all elements in Ω c i,j . From (5) we can see that the value of y c can only be increased if elements in Ω c are increased. Ω has been frequently applied for network explanation and weakly unsupervised localization [1, 27] .

The proposed GG-CAM method has four major components. Firstly, we propose a novel layer, termed CAM layer, to substitute the standard classification head (the global average pooling layer and the linear layer) in a classification CNN. Secondly, we describe the method to generate a visual heat map from raw gaze coordinates. Thirdly, we introduce a novel loss function for supervising network attention with human attention. The last is the multi-task training methods that we adopt to balance the attention supervision and classification tasks.

From Equations (1) through (4), we know that Ω is embedded in the network though it is not explicitly computed. To better utilize Ω, we propose the CAM layer to explicitly compute Ω in the network. The CAM layer is created to replace the standard classification head, the global average pooling layer and the linear layer, in a generic classification CNN as shown in Figure 1a . The CAM layer has the same number of trainable parameters, Λ and λ, as in a linear layer. Mathematically, the CAM layer takes A as input, uses Equation (4) to compute Ω, and then applies (5) to Ω for y. Therefore, in principle, the only difference between using the standard classification head and using the CAM layer as the classification head is the rearrangement of the interchangeable mathematical operations. By doing so, as depicted in Figure 1b , there are two possible outputs from the CAM layer: y and Ω, which facilitates the integration of human attention into the network. 

Usually, to train a classification CNN, the cross entropy loss, L ce , is used:

where Y ∈ {1, 2, ..., C} is the true class label, andŷ = Softmax(y). Substituting (5) in (7) yields:

From (8), we see that the cross entropy loss only focuses on the mean values of each Ω i , such that different Ω can possibly result in the same L ce as long as the mean value for each Ω i , i ∈ {1, 2, ..., C}, is unchanged. This opens the way for us to regulate Ω without jeopardizing the network's performances.

To regulate network attention with human attention, we proposed the selective mean square error (MSE) loss, L sm , which supervises Ω with Ψ during training:

where Y ∈ {1, 2, ..., C} is the true class label for input I. Ω c is related to Ω c via the following equation:

where c = 1, 2, ..., C and α > 0 is a trainable scalar. In Equation (9), we can see that for input I with label Y , only Ω Y is supervised, and other Ω i , i = Y , are neglected. That is based on the fact that one's visual pattern is task specific as described in Section 2.2. We expect that the biological and artificial attention for a heart condition is different from that for a lung infection. Therefore, visual attention for input I with label Y should only be used to supervise Ω Y . Equation (10) is applied to elements in Ω for two main reasons. The first reason lies in the boundlessness of Ω: elements in Ω can have any values. Figure 2 shows the distribution of elements in Ω for two fully-trained CNNs, ResNet and EfficientNetv2, for the CXR classification task. The distribution functions are shown in log-scale. We can see that both distributions are heavy-tailed with median, mode and mean values around 0. However, elements in the visual heat maps, according to Algorithm 1, are bounded between 0 and 1. Hence, to scale Ω, the Sigmoid function is used. Additionally, we introduce an additional trainable parameter α in Equation (10) to allow for more flexibility in the supervision.

In the previous section, other than the classification loss L ce , we introduce L sm for attention supervision. With two losses for a single CNN, the multi-task learning methods should be adopted to balance the effects from each loss functions. In this paper, we use the following equation to perform multi-task learning [28] :

L ce +ln(σ sm +1)+ln(σ ce +1) (11) where σ sm > 0, σ ce > 0 are trainable parameters that can dynamically weigh the two losses L sm and L sm ; and ln(σ sm + 1) and ln(σ ce + 1) are used to penalize large values of σ sm and σ ce , respectively. Equation (11) was derived from [24] , where σ sm and σ ce are used to measure the uncertainties of the corresponding loss terms. The advantage of Equation (11) as compared to the loss used in [24] is that the loss function is non-negative and can not diverge to minus infinity during training. Additionally, experimental results suggest the initialization of parameters σ sm and σ ce effects the performance of the network. As classification is the primary task for the network, L ce should be emphasised, especially towards the ending phase of the training process. Therefore, we initialize σ sm with a small value close to 0, and set σ ce = 1. A small initial σ sm will force the network to focus more on reducing L sm at the starting phase of the training process. Gradually, σ sm increases, and the network no longer places its primary focus on L sm . In this way, L sm and L ce have similar contributions to L, especially towards the end of the training process.

To validate our method, we apply GG-CAM to two standard classification CNNs: ResNet50 [15] and Efficient-Netv2 (s) [35] . To present the comparative analysis, we firstly introduce the dataset that we use for this research, as well as settings during the training process. Then, we present quantitative results on the network's classification performance and interpretability. Lastly, limitations and future improvements are discussed.

We use the multi-modal CXR dataset from [22] (available at PhysioNet [11] ). It contains 1083 CXR images originated from the CXR-MIMIC dataset [21] . Accompanying each image is a label (either normal, cardiomegaly or pneumonia), segmentation (mediastanum, aortic knob, left and right lungs), radiology report (text and audio), and tracked eye-gaze coordinates from an expert radiologist examining the image. The GP3 Gaze Tracker (by Gazepoint) with 60 Hz sampling frequency was used, and the tracked gaze positions have an accuracy around 1°of visual angle (approximately 100-400 pixels on the image depending on the screen size, eye-to-screen distance and calibration accuracy). It is a balanced dataset with even share for each labels.

For the training, validation, and testing, we use 70%, 10%, and 20% of the dataset, respectively. Each sub-dataset is randomly generated and preserves the label balancing. The raw CXR images in the dataset follow the digital imaging and communications in medicine (DICOM) standard and have very high resolution, i.e., 2544 × 3056 pixels. Before feeding the images to CNNs, we downsample the images so that the heights and weights are halved. To reduce the high dynamic range of DICOM images, we normalize all images to the range of [0, 1].

We use the PyTorch framework for the implementation, training, and testing of CNNs. During the training process, we implement the "reduce learning rate on plateau" mechanism, such that the learning rate will be reduced to 10% once the validation loss has not decreased for P consecutive epochs. For all networks, we optimize the following hyper-parameters: learning rate, optimizer, and P . The optimizers we use are: Adam, Adamax, stochastic gradient descent (SGD), and SGD+momentum, with their default settings. For GG-CAM modified networks, we also optimize for the initialization of σ sm , and the Gaussian blur parameter B. Experimental results indicate that the network's performance is not sensitive to B when B is in range [200, 1000] . Learning rate and the initialization of σ sm play more important roles in the learning process.

We use "EffNet" and "ResNet" to abbreviate Efficient-Netv2 (s) and ResNet50, respectively. We append "GG-CAM" to the GG-CAM modified network name (connected by a plus sign) to differentiate standard CNNs from GG-CAM modified CNNs, respectively, i.e., EffNet and ResNet+GG-CAM. Due to the randomness in the network's performance even with identical hyper-parameters, all results for CNNs shown in this paper are based on 5 independent trainings with same hyper-parameters. In Table 1 , we summarize the optimized hyper-parameters for each CNN. Adam is the best optimizer for all CNNs, and thus not included in the table.

To evaluate the networks' classification performance, we use the multi-class area under curve (AUC) metrics [13] . Comparative results are shown in Figures 3a and 3b . We indicates the difference is statistically significant. Therefore, we conclude that GG-CAM modified CNN can improve the classification significantly as compared to the corresponding standard CNN. More specifically, from 0.723 to 0.801, the median AUC for EffNet+GG-CAM is 0.078 greater than that for EffNet. From 0.721 to 0.776, the median AUC for ResNet+GG-CAM is 0.056 greater than that for ResNet. We focus on median statistics in this paper to avoid influences from outliers, which may occur occasionally (less than 5%) when the networks fail to converge. In Table 2 , we report more detailed classification performance on each label. We can see that the difficulty in classifying pneumonia is greater than for other labels. The performance of EfficientNetv2 is superior to ResNet, and GG-CAM modification generally improves the network's performance.

We also tested the performance of the network in [40] and [39] , termed as PNet, with our dataset. The results suggests that the AUC metrics from PNet are statistically equivalent to that in ResNet and EffNet according to the ANOVA test (p¿0.05). Given that the number of trainable parameters in PNet is at least twice as large as in ResNet or EffNet, and is a customized hybrid model that also uses ResNet in its structure, we did not further apply the GG-CAM modification to PNet for analysis.

We know that the pathologies that we try to classify are organ specific: cardiomegaly should only occur at a patient's heart and pneumonia is a lung infection. Therefore, an interpretable classifier with an abnormal CXR image would put attention to the corresponding organ/anatomical areas on the image. Based on this rationale, we can apply the evaluation method in [32] , where the authors calculate the percentage of CAM attention heat maps whose peak lies within the region of the corresponding classification objectives. A larger percentage indicates that the network is more likely to put its attention to the legitimate areas on the input image, and thus more interpretable. Mathematically, in our dataset, for an input I with true label Y (cardiomegaly or pneumonia), we declare the network's attention is inter-

where S 2 is the set of image coordinates for heart segmentation, and S 3 is the set of image coordinates for lung segmentation. Note that the dimension of Ω Y is different from that of the input I. Table 2 : Detailed metrics for each class and CNN. Values are presented as median statistics followed by the standard deviation after the ± sign. The best metrics are highlighted in bold font. of the interpretability is large for all CNNs. A statistically significant improvement for interpretability with GG-CAM modification only occurs for ResNet but not for EffNet. For pneumonia, the GG-CAM modification not only improves the interpretability but also reduces the variance of the interpretability distribution.

To visualize network attention and visual attention, in Table 3 , we present Ω Y and Ψ for each CNN with the testing images. By observing the figures, we can see that network attention, as compared to human visual attention, is different. Concentrated red areas of the visual heat map indicates that the expert's attention is specific to the targeted regions, which usually are the abnormal areas. Whereas the network attention is much less concentrated and covers a broader extension of spaces. Despite the differences, patterns of attention from GG-CAM modified CNNs can be observed. For Y = 1 (normal CXR images), attention for EffNet+GG-CAM and ResNet+GG-CAM is allocated to the entire heart and lung regions in the image, which can be interpreted as the network checking if any abnormality is present in these regions. For Y = 2 (cardiomegaly CXR images), heart and neighbouring regions are attended. For Y = 3 (pneumonia CXR images), the network attention approximately follow the visual attention, highlighting potentially infected lung sections. However, the attention for EffNet(standard) and ResNet(standard) is more random and less interpretable.

Though the GG-CAM modified CNNs usually produce better classification results, disadvantages and limitations awaiting future improvements still exist. The first is the extended training time. Experimental result shows that a GG-CAM modified CNN often requires more epochs (200 to 300 epochs) to converge as compared to the standard CNN (usually less than 150 epochs). Secondly, the GG-CAM is only applicable to networks with architectures shown in Figure 1b . As human attention is not limited to the application of classification, it may be worthwhile to extend the GG-CAM method to a broader range of networks and/or tasks. Finally, we have only validated our method in a single radiology task, which follows well-established examination standards and protocols. However, as human attention exhibits diversified patterns across individuals, it is worth exploring how our method works when the tasks do not have structured standards and protocols.

In this paper, we showed that human attention can be used to directly regulate the network attention that not only boosts the network's performance but also enhances the interpretability. With GG-CAM, most classification CNNs can be modified to gain the capability of augmenting their own attention with human attention. Therefore, by recording the gaze behavior, we can develop better methods to alleviate the workload for medical practitioners and specialists performing time-consuming and labour-intensive CXR annotation tasks.

While we are still at an stage of understanding artificial and biological neural networks, we believe our results can provide insights for future interdisciplinary research and applications in attention and computer vision.

Rethinking class activation mapping for weakly supervised object localization

Gaze-informed multi-objective imitation learning from human demonstrations

Multi-task sonoeyenet: detection of fetal standardized planes assisted by generated sonographer attention maps

Sonoeyenet: Standardized fetal ultrasound plane detection informed by eye tracking

Deep learning for chest x-ray analysis: A survey

Evaluation of gaze tracking calibration for longitudinal biomedical imaging studies

Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks

Ultrasound image representation learning by modeling sonographer visual attention

Discovering salient anatomical landmarks by predicting human gaze

An efficient algorithm for gaussian blur using finite-state machines

Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals. circulation

A survey on visual transformer

A simple generalisation of the area under the roc curve for multiple class classification problems

Efficient pneumonia detection in chest xray images using deep transfer learning

Deep residual learning for image recognition

Eye tracking: A comprehensive guide to methods and measures

No one knows what attention is. Attention, Perception, & Psychophysics

Mobilenets: Efficient convolutional neural networks for mobile vision applications

Squeeze-and-excitation networks

Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison

Mimic-cxr, a deidentified publicly available database of chest radiographs with free-text reports

Creation and validation of a chest x-ray dataset with eye-tracking and report dictation for ai development

Gaze embeddings for zero-shot image classification

Multi-task learning using uncertainty to weigh losses for scene geometry and semantics

Large dataset of labeled optical coherence tomography (oct) and chest x-ray images

Intelligent pneumonia identification from chest x-rays: A systematic literature review

Weakly-supervised self-training for breast cancer localization

Auxiliary tasks in multitask learning

Attention in psychology, neuroscience, and machine learning

Using eye gaze to enhance generalization of imitation networks to unseen environments

Attention gated networks: Learning to leverage salient regions in medical images

Grad-cam: Visual explanations from deep networks via gradient-based localization

Multi-modal learning from video, eye tracking, and pupillometry for operator skill characterization in clinical fetal ultrasound

Efficientnet: Rethinking model scaling for convolutional neural networks

Smaller models and faster training

Attention is all you need

Score-cam: Score-weighted visual explanations for convolutional neural networks

Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases

A robust network architecture to detect normal chest x-ray radiographs

Comparison of chest radiograph interpretations by artificial intelligence algorithm vs radiology residents

Human gaze assisted artificial intelligence: a review

Learning deep features for discriminative localization

Hand-eye coordination-based implicit re-calibration method for gaze tracking on ultrasound machines: a statistical approach

The neyman pearson detection of microsaccades with maximum likelihood estimation of parameters

A novel gaze-supported multimodal human-computer interaction for ultrasound machines