key: cord-0538965-qo9u5sjo
authors: Modegh, Rassa Ghavami; Salimi, Ahmad; Rabiee, Hamid R.
title: LAP: An Attention-Based Module for Faithful Interpretation and Knowledge Injection in Convolutional Neural Networks
date: 2022-01-27
journal: nan
DOI: nan
sha: 0865aea358bbd3abae87fa0b45901ad0620f6bb0
doc_id: 538965
cord_uid: qo9u5sjo

Despite the state-of-the-art performance of deep convolutional neural networks, they are susceptible to bias and malfunction in unseen situations. The complex computation behind their reasoning is not sufficiently human-understandable to develop trust. External explainer methods have tried to interpret the network decisions in a human-understandable way, but they are accused of fallacies due to their assumptions and simplifications. On the other side, the inherent self-interpretability of models, while being more robust to the mentioned fallacies, cannot be applied to the already trained models. In this work, we propose a new attention-based pooling layer, called Local Attention Pooling (LAP), that accomplishes self-interpretability and the possibility for knowledge injection while improving the model's performance. Moreover, several weakly-supervised knowledge injection methodologies are provided to enhance the process of training. We verified our claims by evaluating several LAP-extended models on three different datasets, including Imagenet. The proposed framework offers more valid human-understandable and more faithful-to-the-model interpretations than the commonly used white-box explainer methods.

As Artificial Intelligence (AI) proved its efficiency and superior-than-humans performance in many fields, its applications have been expanded. Nowadays, AI has entered into real-life applications like clinical computer-aided decision systems, medical diagnosis, and autonomous car driving. These critical applications arose whether AI models are trustable and their decisions are valid. Deep Neural Networks (DNNs), as one of the most successful AI models, make their decisions by complex computations which are not understandable by humans. They are trained end-to-end and are susceptible to learning detours and biases of the dataset rather than the actual concepts and reasons. Since AI has become responsible for making decisions in areas interfering with human rights and ethics, governments have started to make laws about its usages. For example, the European Union has adopted new regulations which enable users to demand an explanation of an algorithmic decision that has affected them [13] . This has strengthened the urge for DNNs to explain themselves. Explaining DNNs have other virtues besides verification of decisions, bias detection, developing trust, and compliance to legislation [5] ; it can help in diagnosing the model. Also, knowledge can be discovered from the models with superior-than-human performance to enrich human knowledge [9] .

In recent years, there have been many attempts to explain and interpret DNNs' decisions. These studies can be divided into two general areas of intrinsic and post-hoc methods. Intrinsic interpretability is achieved by enforcing interpretability into the model's architecture [31] , [40] , [43] and the training strategy [8] , [10] , [17] , [43] . In this approach, the model itself can provide explanations for its decisions. These methods do not apply to already trained models [12] , [16] . They generally pose limitations over the model's architecture, and some may sacrifice the performance to achieve interpretability [12] . Post-hoc methods try to provide explanations for already trained models. They adopt assumptions and simplifications in their computations that may sacrifice fidelity to achieve more human-understandable explanations [12] . These assumptions also may lead to false interpretations as they may not be valid in the model's decision-making process [35] .

Attention-based architectures are a type of intrinsically interpretable architectures. Attention was first used in Natural Language Processing (NLP) to enable a word to take effect from any of the sentence's words without being restricted to their distance [3] . It could provide a score over the words to highlight which words are more important in deciding about another word. Later its applications extended to the field of vision to mimic human attention in tasks such as visual question answering [22] and image captioning [42] . In general, attention provides an importance score over all input tokens. In this work, we utilize the attention mechanism to propose a new pooling layer, easily pluggable into any convolutional neural network (CNN). Our main contributions can be summarized as follows:

• Introducing a new module that is easily pluggable to any CNN, including the already trained networks, that accommodate the model with self-interpretability and the possibility to inject knowledge without restricting the architecture, performance loss, adding parameters to the main-stream of information flow, and increasing order of computations • Proposing a concept-wise attention mechanism that assigns attention scores to any predefined domain concepts to distinguish the importance of each pixel according to each concept ( Fig. 1 ) • Proposing a weakly supervised method for injecting knowledge into the model, useful in shaping the decision-making process of the model leading to more interpretability II. RELATED WORKS A. Importance based pooling

Pooling layers and strided convolutions are widely used in CNNs to increase the receptive field and decrease memory consumption. Gao et al. have proposed a unified framework for formulating different pooling strategies, called Local Aggregation and Normalization (LAN), and a pooling method called Local Importance Pooling (LIP) [11] . This framework aggregates features within local sliding windows by weighted averaging. The weights are assigned based on the importance of the features. According to this framework, average pooling assumes the same importance score for all the pixels and is susceptible to feature fading. Max pooling assigns one to the highest feature and zeroes to all the others, leading to very sparse gradient paths and slow training. Strided convolutions assign importance based on the pixel's location in the window and are more sensitive to shift variances. LIP has used the attention mechanism to assign importance weights to the features. LIP is applied feature-wise, which makes it different from our proposed architecture. All of the mentioned pooling layers except strided convolutions can lead to loss of relative spatial relations, as they select different features from different spatial locations [27] .

In recent years, many studies have been published about the interpretability and explainability of DNNs. Some of the proposed methods are model-agnostic and treat models as black-boxes. One group of model-agnostic methods mimics the operations of the black-box by training a white-box model and interpreting it instead [38] , which is susceptible to errors as the mimicked model does not perform exactly as the primary model. Another group of model-agnostic methods like LIME [29] assesses feature sensitivity by perturbing the feature space around each input which demands an optimization per sample and is computationally inefficient for being applied on many samples with large feature spaces like images.

Model-specific methods work on specific white-box models, meaning they use the architecture and parameters of the model to provide explanations. CAM [45] is only applicable to the networks having one final fully connected layer after the last convolutional layer. It uses the weights of features in the fully connected layer to find the importance of the features for each class and then calculates pixel scores based on their channel-wise activations and the channel importance. Gradient-based methods use gradients of class scores w.r.t the input to find the most sensitive features as if the network is estimated with a Taylor series of order one. The gradients are the weights of the features indicating their importance. Vanilla gradient [34] uses the pure gradients to find important features. It produces a noisy importance map. Guided Backpropagation [36] filters negative gradient flows to bold out the effective pixels. The filtering may lead to false positives. Grad-CAM [32] follows the same steps as CAM, but it uses gradients instead to find the importance of channels which makes it applicable to a broader group of networks. Guided Grad-CAM [32] is a multiplication of Guided Backpropagation and Grad-CAM to produce fine-grained importance maps.

The gradient-based methods suffer from gradient saturation problems leading to near-zero importance scores. Score-based methods like Layer-wise Relevance Propagation (LRP) [4] and Deep Lift [33] propagate scores instead of gradients to calculate the importance of neurons. LRP has defined layer-specific rules to divide the relevance score of the neurons in each layer to their input neurons. The rules assign relevances based on the contribution of the input neurons to the neurons in the next layer. Deep Lift uses a similar procedure to LRP, but it divides the scores based on the difference of the output achieved compared to a baseline input. But it is not easy to define a suitable and meaningful baseline for all applications. Recently, some works have combined the CAM method with score-based interpretations to improve both. Score-CAM [39] adopts a Deep Lift-style scoring scheme to find the channel-wise increase of confidence and then uses the confidence scores in the CAM method. Similar to Score-CAM, Relevance-CAM [20] uses the LRP method to find the importance of channels in any layer and then applies the CAM method to find the corresponding regions in the input. Relevance-CAM has proved its superior performance to other CAM-based methods. 

Attention maps generated by attention mechanisms generally can be used to explain the model's behavior. This kind of explanation is widely used on Vision Transformers [19] . Yang et al. used Transformer raw attention maps to explain their Human Pose Estimation method [41] . Abnar et al. proposed two methods to aggregate the Transformer's attention maps to produce attention flow, and attention rollout explanation maps [2] . Chefer et al. proposed an explainability method in which relevancy is assigned to attention maps and then propagated throughout all blocks [7] . Mondal et al. used this relevancy propagation method to make their COVID-19 screening system explainable [24] . Recently, Chefer et al. proposed another explanation method based on relevancy propagation [6] . In contrast to their previous work that was only applicable on Transformer encoders, this method is applicable in all transformer architectures like generative models. For CNNs, Zhang et al. [44] and Gu et al. [14] used self-attention layers inside CNN models to explain their medical image analysis systems. These attention-based explanation methods are only applicable to either specific architectures they proposed or the Transformers. Therefore they are different from our work which is applicable to any CNN architecture.

In general, the attention mechanism conveys the information about the most important parts of the input data. We adopt the attention mechanism in the reduction process of the pooling layers. The process is depicted in Fig. 2a . In contrast to LIP, attention is not applied in a channel-wise manner. Instead, the whole feature map is passed to a scoring module to calculate pixel-wise importance scores. Then, the scores related to the pixels under the kernel are normalized for each kernel position. The final feature vector is obtained by weighted averaging the feature vectors of the pixels. This process mimics a zooming action. Instead of mixing the features of the pixels applied to a pooling kernel, the network dynamically detects the most important pixels and passes their features instead. In this way, the small yet important details are not faded or lost but propagated through the network's depth. It also prevents fallacies produced by feature mixing. Different zooming locations under each kernel position may also help in making the model more robust toward shifts and scales.

We have modified the LAN framework to express the pooling procedure of LAP in Eq. (1) . In this equation, the output O i ,j corresponding to a sliding kernel of size W K × H K with the top-left corner of (i, j) is calculated based on the input feature map X related to the pixels under the kernel, X i:i+H K ,j:j+W K , and the weighting function F . In this equation, stands for element-wise multiplication. The problem with LAN was its calculation of importance weights based on the single values, while LAP uses the whole feature map under a kernel to calculate the weights.

In this work, F combines two steps: the pixel-wise scoring, S, and local kernel-wise normalization, N . These steps are presented in Fig. 2a . The general form of F can be illustrated as Eq. (2) in which S and N can be any function.

Interpretability is one of the well-known benefits of the attention mechanism. Attention scores identify the relative importance of each part of the input. Although any arbitrary scoring function can be used for this purpose, relative importance cannot be interpreted in an absolute way. To achieve the end, we designed a pixel-wise scoring module S as presented in Fig. 2b .

In general, deep models should learn one or more problem-specific concepts to do their tasks, and this architecture helps make the concepts distinguishable by LAPs in the network. In this design, we have considered h concept scoring heads, each responsible for assigning an importance score to the pixels for their corresponding concept. The sigmoid function is applied over the scores providing concept-wise importance probabilities, I Concept . The importance probabilities are then aggregated to calculate the final score. Therefore, S can be illustrated as Eq. (3), in which σ is the sigmoid function, S C is the concept pixel-wise scoring function, and A is the aggregation function. These concepts may vary based on the domain. Therefore they can be defined by experts. The functions S C and A are also defined based on the concepts; S C is a trainable module, but A can be either trainable or fixed aggregation function like maximum.

One can adopt any arbitrary method, such as softmax, to normalize the importance scores locally based on the scores of the pixels under the kernel at each position. We have adopted the normalization function:

In this function, the importance probabilities are multiplied by a factor derived from the Gaussian kernel around the highest local importance probability. So the sensitivity toward the most highlighted local pixel becomes adjustable by the trainable parameter α. Theoretically, when α = 0, N (V ) = V + , and as α → ∞, the coefficient of the pixels other than max{V } would approach 0, and LAP would convey the features of the most highlighted local pixel directly. We have added a small value to prevent the weights from becoming zero. This value helps preserve the gradient flow to all pixels and prevents zero division in the weighted averaging process.

CNNs are generally a stack of layers, most of which work by sliding a kernel all over the input and calculating a function. Although the shallower layers have lower receptive fields, they extract fine common details like edges and corners. As the depth increases, the receptive field increases, and the layers become responsible for extracting higher-level concepts [25] . The final decision is made based on the information flow through the layers, and somewhere in the middle, it should have understood the distinguishing concepts. LAP helps specify those concepts besides keeping them bold in the pooling processes. As the importance is identified based on the internal forward flow of the network, LAP modules make the model self-interpretable. LAPs do not produce false positives or negatives due to ignored flows to reach a human-understandable interpretation in contrast to external explainer methods. They also find the important parts in the same direction that the model decides in the forward process.

Human experts make their decisions based on special features of the input. For example, jaggedness is one of the factors considered in classifying a tumor as benign or malignant. The neural networks trained freely may or may not have considered all of the reasons in their decision-making. They may have become biased to a dominant feature in the training dataset and loose generalization. LAP provides an easy way for injecting experts' knowledge into the network due to the probabilistic behavior of its scoring module. Experts can highlight the input parts that are important for their decision-making for each concept. The highlighted map can be resized to each LAP's input and used as ground truth to train each concept scoring head. Knowledge injection gives a better guarantee that the network decides based on the factors of the domain.

In most situations, we do not have detailed experts' supervision, but we have a general knowledge about the problem. So the LAP modules can be trained in a weakly supervised manner to behave as desired. Therefore it will also benefit from the gradient injection and fast training. Here, we have used a linear combination of the semi-supervised losses and the main task's loss to train the models, as described below.

1) Concept-discrimination loss: For adopting the mentioned loss, we used the design presented in Fig. 2b for the scoring module, with h concept heads. We assume each sample s has a set of concepts C s . Each concept head highlights the important pixels for the concept, called concept-related clues. To train the concept heads, we considered the following loss terms: Min Active Ratio (MinAR). Each concept head should assign high importance probability to at least a specified portion of the pixels in the samples containing the concept. This makes the model highlight the concept-related clues for the sample. Max Active Ratio (MaxAR). Each concept head should assign high importance probability to at most a specified portion of the pixels in the samples containing the concept, as the main clue is not understandable otherwise. This phenomenon may happen in layers with large receptive fields. As the concept-related clue appears in the receptive fields of all of the pixels, the network may consider all of them as concept-related clues.

Inactive Ratio (IAR). Each concept head should assign low importance probability to all the pixels of the samples not containing the concept, as the concept-related clue should not exist in them. This loss can be applied to all the pixels. But because generally, most of the pixels are already inactive in the attention map, this loss term is likely to fade. Therefore, it is better to apply this term on top-scored pixels according to IAR.

Consider a LAP module l with an input size H × W . The loss function for this LAP is shown in Eq. (5) . In this equation, N c and Nĉ are the number of samples containing and not containing concept c, respectively. k 1 = MinAR × HW and k 2 = (1 − MaxAR) HW are the numbers of pixels encouraged to have respectively high and low probabilities in the in the samples containing concept c, and k 3 = IAR × HW is the number of top pixel encouraged to be inactive in other samples. top l,c s,k1 and bot l,c s,k2 are the sets of pixels of the k 1 high-ranked and k 2 low-ranked pixels of sample s based on the importance probability of concept head c. p l,c s,i,j is the probability of concept head c for the pixel (i, j) of sample s. MinAR, MaxAR, and IAR are hyper-parameters. The first term is multiplied by 2 to balance the effect of positive and negative losses to the concept head c.

We observed in our experiments that choosing top l,c s,k and bot l,c s,k sets based on the probabilities assigned by concept head, is sensitive to the initial model parameters. If high weight is considered for concept-discrimination loss, it is likely to get stuck in considering a wrong zone in layers with high receptive fields. To prevent this, we used another module with the same architecture as the scoring module to choose top l,c s,k and bot l,c s,k sets from. This module is trained using the first and third terms of the loss function presented in Eq. (5), without multiplying the first term by 2, and with k 1 = k 3 = HW . We call this module the discriminative scoring module. We detached the input of this module to prevent misleading loss injection to the model, as many input parts may be common between the concepts.

2) Knowledge sharing by concordance loss: Intuitively, if one LAP layer has found a part of the input as the distinguishing clue for one concept, the proceeding LAP layers should also distinguish the same clue. This fact does not always hold for the previous layers, as they might not have enough understanding to perceive the clue, mainly due to the low receptive field. A human expert can decide whether the receptive field is enough for the LAP layers. In that case, we have used the Jensen-Shannon divergence loss for encouraging the consecutive LAP layers to produce similar maps: 

where J S(l, s, c) is the loss for the concept head c of sample s between the LAPs l and l + 1. The loss is calculated only on the pixels whose importance probabilities in two consecutive LAP layers are more than a specified threshold t. M s is the number of such pixels in sample s.

Using the Jensen-Shannon loss causes the LAP layers to help each other in finding more clues and the found clues are also more robust. If the receptive field does not suffice in some layer, e.g., l, only the pixels with high importance probabilities of l and low importance probabilities of l + 1 can be used in the Jensen-Shannon divergence loss. We have applied this loss between each pair of consecutive LAP layers.

LAP is easily pluggable into any convolutional architecture. Pooling and adaptive pooling layers can be replaced directly with LAPs. Strided convolutions can also be replaced by a convolution with the stride of one, proceeding with a LAP with the same kernel size and stride as the convolution's stride. The unique advantage of LAP is that it can be plugged into an already trained model and tuned while other model layers are frozen.

In the experiments, we aimed to show the general applicability of LAP to different domains and architectures without performance loss and compare its interpretations with state-of-the-art methods for interpreting a model with no or slight modification. To this end, we used three datasets from three different domains. To show the general applicability, we adopted two widely used CNN architectures in our experiments, ResNet [15] , and Inception-V3 [37] . While both architectures have high performance, they have different core ideas. ResNet is famous for its residual connections that prepare uninterrupted gradient paths to prevent gradient fading. Inception-V3 is known for its multi-resolution analysis by applying kernels of different sizes to the feature map at each network level.

We compared our interpretations with five white-box explainer methods, Guided Back-propagation (GBP), Grad-CAM (GC), Guided Grad-CAM (GCC), Deep Lift (DL), and Relevance-CAM (RC). We did our experiments on PyTorch [26] and used the implementations of captum [18] for the interpretation methods. The experiment setup and results for the datasets are presented in the proceeding subsections.

A. RSNA pneumonia detection RSNA pneumonia detection dataset [1] was published in a Kaggle challenge in 2018. The dataset contains chest X-Ray images of 8851 healthy people, 9555 patients having lung pneumonia, and 11821 patients with other lung abnormalities. The zones related to pneumonia have been specified by experts using bounding boxes. We used the samples related to healthy people and lung pneumonia patients in a classification task. We used 81% of the data for training, 9% for validation, and 10% as the test set.

ResNet 18 and Inception V3 were used as the base architectures in this task. We placed three LAP modules in blocks 2, 3, and 4 of ResNet 18, and maxpool2, Mixed6a, and Mixed7a of Inception V3. Adaptive pooling was also replaced with adaptive LAP in both networks. We used two 1 × 1 convolution layers with eight hidden channels and one concept head for detecting pneumonia, obligated to be partly active in positive samples and completely inactive in negative ones. As there is only one concept head, there is no need for an aggregation module, and final attention scores are equal to the concept scores. We trained LAP-extended models with two different methods, weak knowledge injection (WS) and experts' knowledge injection (BB). For weak supervision, we used cross-entropy loss on the classification head, concept-discrimination loss, and inter-LAPs concordance loss with weights of 1, 0.25 per LAP, and 0.25 per LAP pair, respectively. The hyper-parameters of the concept-discrimination loss were as follows: MinAR = 0.1, MaxAR = 0.5, IAR = 0.1. The first two were set based on the possible range of concept sizes, i.e., infection, in the positive samples. We chose the median and maximum of the infection bounding-boxes areas as the mentioned bounds. The third was set to 0.1 to avoid fading its corresponding loss term. For full supervision, we used cross-entropy loss on the classification head alongside cross-entropy loss on LAPs considering experts' bounding boxes as the ground truth with weights of 1 and 0.25 per LAP, respectively. The detail of the loss function is explained further in Appendix A.

We trained the models for 300 epochs, where we observed convergence due to no change in the last 50 epochs. We selected the model related to the epoch with the best performance on validation data as the final model. We used batches of size 64 (32 healthy and 32 pneumonia samples) and the ADAM optimizer with an initial learning rate of 10 −4 and a decay coefficient of 10 −6 in training. The models were trained on a GEFORCE RTX 2080 Ti GPU.

We evaluated the models using four metrics, accuracy, sensitivity, specificity, and Balanced Accuracy (BA). Sensitivity and specificity are recall factors for positive and negative classes, respectively. BA is the average of the recalls, which gives a fair metric for imbalanced datasets. The evaluation metrics on test data are presented in Tab. I. In both architectures, the LAP-extended versions have surpassed the performance of the vanilla models. Each LAP layer can also be used as a standalone predictor. If a LAP has assigned a probability of more than 0.5 to at least one pixel, it means it has found infection in the sample. The prediction of the LAP for these samples is assumed positive and otherwise negative. We evaluated the predictivity of LAP modules as standalone deciders and the accuracy of their faithfulness to the model's decisions. According to Tab. I, as expected, the deeper the layer and the larger its receptive field, the higher the predictivity and the faithfulness. It is also observable that the WS-LAP Inception V3 model is more accurate than the BB-LAP one. In the case of LAP predictivity, the BB-LAPs in both ResNet 18 and Inception V3 generally have higher accuracies than WS-LAPs, especially in LAP 2 . This observation implies that using exact supervision leads to better predictivity in most LAP layers.

To verify the superior performance of our self-interpretation method, we compared the interpretations with five famous whitebox explainers, GC, GGC, GBP, DL, and RC. We used experts' bounding boxes as ground truth to evaluate the interpretations DL GGC GBP LAP All LAP 4 LAP 3 LAP 2 LAP 1 BB P RC GT GC of the methods as a localization task. One of the strong points about our method is its global interpretability, meaning the importance scores have the same scale in all samples, and any score greater than 0.5 is interpreted as important. The other interpretation methods provide relative importance scores and do not provide a cross-sample threshold to distinguish the important pixels from unimportant ones. To find a global threshold for other methods, we normalized each importance map by its maximum value. Then we trained a binary linear classifier based on normalized scores to discriminate the pixels under bounding boxes from the others in the validation set (More details are available in Appendix B). We used the classifier's threshold to binarize the normalized score maps of the test data. We used intersection over union (IoU) between the binarized interpretation maps and the ground truth bounding boxes to compare the interpretation methods (Binarization by Thresholding). As observed in Tab. II, our method has a significantly higher performance than the other methods in both WS-LAP and BB-LAP models, except in BB-LAP Inception V3, in which LAP is slightly outperformed by RC. We also adopted another technique for binarizing the infection maps (Binarization by Top-Scored Selection). As ground truth bounding boxes are available, we selected as many top-scored pixels as the bounding boxes area of each sample. Again, LAP has achieved significantly higher performance than the other methods in all models, except in BB-LAP Inception V3, in which LAP is slightly outperformed by RC. We visualized the interpretations of different methods over four examples for BB-LAP Inception V3 in Fig. 3 . To explore the general applicability of our method, we investigated its effectiveness in a different domain, using Large-scale CelebFaces Attributes (CelebA) Dataset [21] . CelebA dataset is a face attributes detection dataset containing 40 face attributes of celebrity images. We applied binary classification on the smile attribute of this dataset. The dataset consists of 202,599 samples. We randomly selected 70%, 10%, and 20% of the data for training, validation, and test sets, respectively.

In this task, ResNet 18 and Inception V3 were used as base architectures, and the LAPs were placed as described in Sec. IV-A. We used two 1 × 1 convolution layers with eight hidden channels and three heads to generate concept-wise scores. The first two concept heads refer to negative and positive classes, each obligated to be partially active in its respective class samples and completely inactive otherwise. The third head was added to highlight the common effective pixels with a low concept-discriminability due to the lack of receptive field. The sum of the three heads' scores was used as the aggregation module to generate final attention scores. We trained LAP-extended models with our proposed weak knowledge injection method. We used the same loss terms and procedure of choosing the hyper-parameters of the concept-discrimination loss as Sec. IV-A. The hyper-parameters were set as follows: MinAR = of 0.02, MaxAR = 0.2, IAR = 0.01 for the first two heads and MaxAR = 0.1, without MinAR and IAR for the common concept head. Choosing these hyper-parameters was the same as Sec. IV-A for the CelebA dataset. We trained the networks for 12 epochs, where we observed convergence. The other training hyper-parameters were also the same as in Sec. IV-A. The evaluation metrics on the test set are presented in Tab. III. In both architectures, the LAP-extended versions have surpassed the performance of the vanilla ones. We also evaluated LAPs' performance and faithfulness as standalone predictors. In this experiment, we took the class with the higher sum of importance scores as the final prediction of the LAP module. We can observe that LAP layers have a very high faithfulness to the model. Moreover, we can use LAPs to diagnose the model's performance throughout its depth. For example, LAP 2 in LAP Inception V3 has comparable accuracy to the final classifier and higher accuracy than the vanilla version. So we can drop the proceeding layers to have a much smaller and faster model in the inference stage without performance loss. The LAP interpretations are visualized for four examples in Fig. 4 . As expected, the interpretations are concentrated around the lips. LAPs also have been successful in discriminating the concepts.

In this experiment, we explored the adaptability of our LAPs in already trained models. We chose the ImageNet classification task [30] to assess whether the LAPs can handle interpretations of objects with high variance in size. We used pre-trained ResNet 50 from the torchvision model zoo [23] as the base architecture. Generally, the objects of the ImageNet dataset may have a larger size than the receptive field of the first three layers. Therefore, we only used a LAP in the fourth layer. We used a 1 × 1 convolution layer with 1000 heads, each obligated to be partly active in its respective class samples and completely inactive otherwise. We used a 1 × 1 convolution layer as the aggregation module to generate the final attention scores. We used our proposed weakly supervised loss (MinAR and IAR of 0.01, without MaxAR) with a factor of 0.125 beside the main cross-entropy loss. We first trained only the LAP module while other parameters were frozen for two epochs using the ADAM optimizer with an initial learning rate of 10 −4 and a decay coefficient of 10 −6 . Then, we fine-tuned the fourth layer (containing the LAP) and the fully connected layer of the ResNet 50 using a stochastic gradient descent optimizer with an initial learning rate of 10 −3 and a decay coefficient of 10 −6 for three epochs. The performances of the original model, the version with its LAP trained, and the final tuned one are presented in Tab. IV. It is observable that LAP has adapted itself to the model while the performance is slightly improved.

In contrast to RSNA and CelebA, which had few simple concepts, ImageNet has 1000 concepts with objects of highvariance sizes. Therefore it is not straightforward to evaluate the LAP's predictivity and faithfulness. We defined the concept size features, F C ∈ R 1000 , as the sum of pixels' importances in each concept map. Then we trained a 2-layer MLP network to classify the samples according to F C . To evaluate the predictivity and faithfulness of the LAP, we compared the predictions of the mentioned MLP network with ground-truth and LAP-ResNet 50 (FT) predictions, respectively. The results are presented in Tab. IV. Despite the complexity of the prediction task, the high similarity between many of the classes, and highly summarized information extracted from the concept maps, the LAP has achieved high predictivity and faithfulness. The interpretation of the class heads according to the top-3 predictions of the model is illustrated in Fig. 5 . As we can observe, the LAP shows that the model concentrates on the sample's different objects, and it is visually faithful to the model's prediction. It also can illustrate the model's bias for predicting a class. In the first example, the flower has been considered a clue of the bee class. In the second example, the model has assigned a high probability to the taxi class because the sample contains some cars and a yellow object, both highlighted by the taxi concept head. In the third example, the model has considered the curvatures of the neck of flamingos as some hooks. The LAP also explains why the model has assigned a high probability to the top-3 classes. In the first example, the bee's head is similar to a fly. In the fourth image, the pattern and color of the snail shell are similar to an acorn.

This paper introduced Local Attention Pool (LAP), a concept-wise attention-based pooling method pluggable into any convolutional architecture, even an already trained one. We showed LAPs accommodate models with self-interpretability without performance loss. Furthermore, LAP attention maps proved their ability in explaining the model's behavior faithful to its predictions compared to other explainers. They also can be used as standalone predictors. Their architecture can adapt to different domains and discriminate their concepts by weak or full supervision and knowledge sharing. In the future, we plan to use LAPs on tasks other than classification and improve the explainability by addressing the issues with receptive-field dependency.

The loss function for full supervision is very similar to the weakly supervised concept discrimination loss of Sec. III-C. We used experts' annotated bounding boxes as the ground truth for active pixels for the first term. Because all the areas under a bounding box may not belong to infection zones, we applied the loss on half of the pixels with higher importance probability within the box only. All the zones out of the bounding boxes correspond to non-infection areas. We applied the second term to all of them. The third term was used similarly as the weakly supervised loss.

To find a global threshold for other explainer methods, we first normalized each importance map by its maximum value according to their papers. Then we created a dataset from the normalized pixel-wise interpretation scores over the validation samples. We assigned the positive label to all the pixels under experts' annotated boxes and negative to the others. We used RidgeClassifier of sklearn [28] for the classification of the pixels. Due to the large number of pixels, we used the lsqr solver with a tolerance of 10 −3 , alpha of 0.01, and set the maximum iterations to 100. We also used balanced weighting to address the issue with the highly imbalanced dataset. We used the point with the prediction label equal to zero as the threshold for binarization. We applied this method for each trained model separately and used the resulting threshold to evaluate the model's interpretations.

LAPs can be considered a sequence of information through the depth of the network. Shallower LAPs cannot capture enough information due to their low receptive field. Therefore, they are likely to make more mistakes. Some pixels may have been assumed to be important but were found unimportant in the deeper layers and vice versa. But because of the low receptive field and high resolution, they produce more detailed maps. We devised an algorithm to integrate interpretations from the final LAP layers iteratively to the initial layers. In this way, we can have both accuracy and resolution. The procedure's pseudo-code is presented in Algorithm 1, in which α is the decay factor to adjust the impact of shallower LAPs. Iteratively, the algorithm modifies the current integrated map, R l+1 , with the current LAP attention map, P l . Considering one pixel of R l+1 , r l+1 ∈ R, and the set of its corresponding pixels in P l , p l ∈ R H K ×W K , if r l+1 is active, i.e., greater than 0.5, at least one pixel in the corresponding zone must have been responsible. If any pixel of the p l is active, the credit only belongs to the active pixels. Otherwise, the current LAP has not comprehended the importance of this zone. Therefore the credit belongs to all of them. Using this scheme, we prune the produced importance map from the last LAP, expected to be the most accurate, to the first.

Due to the limitation on the number of pages in the main paper, we have provided more images interpreted with LAP for RSNA and Imagenet in Figs. 6 and 7, respectively. Algorithm 1 Pseudo-code for integrating the currently integrated map pixel on the position (i, j) from the L th LAP to l + 1 th LAP with l th LAP attention map in the corresponding kernel. This procedure is repeated for each pixel of each LAP layer, from L to 1.

1: α ← 0.8 the impact decay factor 2: L ← The number of LAP layers 3: procedure INTEGRATEPIXEL(R, P, l, i, j, H K , W K ) 4: p l ← GETKERNEL(P l , i, j, H K , W K ) l th LAP attention map for (i, j)'s corresponding kernel of size H K × W K

if l = L then 

Quantifying attention flow in transformers

Neural machine translation by jointly learning to align and translate

Layer-wise relevance propagation for neural networks with local renormalization layers

Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission

Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers

Transformer interpretability beyond attention visualization

Infogan: Interpretable representation learning by information maximizing generative adversarial nets

Techniques for interpretable machine learning

Comprehensible classification models: a position paper

Lip: Local importance-based pooling

Explaining explanations: An overview of interpretability of machine learning

European union regulations on algorithmic decision-making and a "right to explanation

Ca-net: Comprehensive attention convolutional neural networks for explainable medical image segmentation

Deep residual learning for image recognition

Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav)

Auto-encoding variational bayes

Captum: A unified and generic model interpretability library for pytorch

An image is worth 16x16 words: Transformers for image recognition at scale

Relevance-cam: Your model already knows where to look

Deep learning face attributes in the wild

Hierarchical question-image co-attention for visual question answering

Torchvision the machine-vision package of torch

Explainable vision transformer based COVID-19 screening using radiography

Feature visualization

Pytorch: An imperative style, high-performance deep learning library

Capsule networks-a survey

Scikit-learn: Machine learning in Python

why should i trust you?" explaining the predictions of any classifier

ImageNet Large Scale Visual Recognition Challenge

Dynamic routing between capsules

Grad-cam: Visual explanations from deep networks via gradient-based localization

Learning important features through propagating activation differences

Deep inside convolutional networks: Visualising image classification models and saliency maps

When explanations lie: Why many modified bp attributions fail

Striving for simplicity: The all convolutional net

Rethinking the inception architecture for computer vision

Genesim: genetic extraction of a single, interpretable model

Score-cam: Score-weighted visual explanations for convolutional neural networks

Towards interpretable r-cnn by unfolding latent structures

Transpose: Keypoint localization via transformer

Image captioning with semantic attention

Interpretable convolutional neural networks

An explainable 3d residual self-attention deep neural network for joint atrophy localization and alzheimer's disease diagnosis using structural MRI

Learning deep features for discriminative localization