key: cord-0125412-jout7bvd authors: Prabhushankar, Mohit; AlRegib, Ghassan title: Extracting Causal Visual Features for Limited label Classification date: 2021-03-23 journal: nan DOI: nan sha: b69057a382b51eb4b1c20adad043cf352c55772a doc_id: 125412 cord_uid: jout7bvd Neural networks trained to classify images do so by identifying features that allow them to distinguish between classes. These sets of features are either causal or context dependent. Grad-CAM is a popular method of visualizing both sets of features. In this paper, we formalize this feature divide and provide a methodology to extract causal features from Grad-CAM. We do so by defining context features as those features that allow contrast between predicted class and any contrast class. We then apply a set theoretic approach to separate causal from contrast features for COVID-19 CT scans. We show that on average, the image regions with the proposed causal features require 15% less bits when encoded using Huffman encoding, compared to Grad-CAM, for an average increase of 3% classification accuracy, over Grad-CAM. Moreover, we validate the transfer-ability of causal features between networks and comment on the non-human interpretable causal nature of current networks. In the field of image classification, deep learning networks have surpassed the top-5 human error rate [1] on ImageNet dataset [2] . In this dataset, neural networks learn to differentiate between 1000 classes of natural images. The success of deep learning networks on natural images has fostered its usage on computed visual data including biomedical [3] and seismic [4, 5] fields. While the number of learnable classes in these fields is generally limited, neural networks have the additional task of aiding domain specific experts to interpret the explanations behind their decisions to promote trust in the network. For instance, in the field of biomedical imaging, a medical practitioner diagnoses whether a patient is COVID positive or negative based on CT scans [6] . The authors in [7] use transfer learning approaches on CT scans to perform the detection and provide explanatory results using Grad-CAM [8] to justify their network's efficacy in detecting COVID-19. Grad-CAM highlights features in the image that lead to the network's decision. In this paper, we analyze a neural network's causal capability using existing explanatory methods by providing a technique to extract causal features from such explanatory methods. Probabilistic causation assumes a causal relationship between two events C and P if event C increases the probability of occurrence of P [9] . In image classification networks, P refers to decisions made by neural networks based on features C. A popular method for ascertaining probabilistic causality is through interventions in data [10] . In these models, the set of causal features C is varied by intervening in the generation process of C to ascertain the change in the observed decision P . Such interventions can however be long, complex, unethical or impossible [11] like in COVID CT scans. Hence, we forego interventionist causality and rely on observed causality to derive causation. Observed causality relies on passive observation to determine statistical causality. The authors in [12] propose that non-interventionist observation provides two sets of features -causal C, and context B -that lead to decision P . In other words, a decision P is made based on both causal and context features in an image. Hence, existing explanatory methods including [8, 13, 14] highlight C ∪ B features. However, they do not provide a methodology to extract either C or B separately. In this paper, we utilize contrastive features from [15] to approximate B features. We then propose a set-theoretic approach to abstract C out of Grad-CAM's C ∪ B features. The contributions of this paper include: • Formulating a set theoretic interpretation of causal and context features for observed causality in visual data. • Expressing context features as contrastive features. • Providing an evaluation setup that tests for causality in a limited label scenario. In Section 2, we motivate context via contrast and review Grad-CAM and contrastive explanations. We then motivate the proposed method and detail its procedure in Section 3. We finally present the results in Section 4 before concluding in Section 5. Causal and Context features: The authors in [12] define causal features as visual features that exist within the physical body of the object in an image and context features as visual features that surround the object in the image. In this paper, we forego the definitions based on physical locations in favor of the feature's membership towards predicting a class P . We define causal features C as those features whose presence increases the likelihood of occurrence of decision P in any CT scan x. Conversely, the absence of causal features C decreases the probability of decision P . The above two definitions of causal features are derived from the Common Cause Principles [16] and are used in [17] to evaluate causality. We follow a slightly altered methodology to showcase the causal effectiveness of our method. We define context features B contrastively, as features that allow differentiating predicted class P and a contrast class Q, without necessarily causing P . Context and Contrast features: In the field of human visual saliency, the authors in [18] provide an argument for the existence of contextual features of a class that are represented by their relationship with features of other classes. In [19] , the implicit saliency of a neural network is extracted as an expectancy-mismatch between the predicted class against all learned classes thereby empirically validating the existence of contrastive information within neural networks. The authors in [15] extract this information and visualize them as explanations. In this paper, we represent the context features B as contrast features. Grad-CAM and Contrastive Explanations: Consider a trained binary classifier f (). Given an input image x, y = f (x) are the logit outputs of dimensions 2 × 1. The predicted class P of image x is the index of the maximum element in y i.e. P = arg max i yi, ∀i ∈ [1, 2] . Grad-CAM localizes all features in x that leads to a decision P by backpropagating the logit yP to the last convolutional layer l. The per-channel gradients in layer l are summed up to obtain an importance score α k for a channel k, k ∈ [1, K] and multiplied with the activations in their respective channels A k . The importance score weighted activation maps are averaged to obtain the Grad-CAM mask GP = ReLU ( K k=1 α k A k ) for class P . The authors in [15] modified the Grad-CAM framework to backpropagate a loss function LP,Q between predicted class P and a contrast class Q. With the other steps remaining the same, a contrast-importance score α c k weighted contrast mask is given by C = ReLU ( K k=1 α c k A k ) for predicted and contrast classes P and Q. Note that gradients are used as features in multiple works including [20, 21, 22] . We first motivate our method based on set theory before describing the process of extraction of causal features. Consider the setting as described in Section 2 where a binary classification network f () is trained on COVID-19 CT scans [6] . Once trained, for any given scan from the dataset, Grad-CAM provides visual features that combine both causal and context features. Hence, Grad-CAM provides a mask, GP = CP ∪ BP for the prediction P on a given scan x. If the network classifies x correctly with 100% confidence, then f () has resolved causal and context features in-dependently such that GP = CP + BP . However, this rarely occurs in practice and we assume GP = CP + BP − (CP ∩ BP ). Hence, our goal is to extract the relative complement CP \ BP given GP = CP ∪ BP . This is illustrated in Fig. 1 . Based on a visual inspection of the venn diagram, we can rewrite CP \ BP as, (1) Note that we do not have access to either CP or BP . We are only provided with GP . In this paper, we estimate the context features BP using contrastive features from [15] . Specifically, continuing the notations from Section 2, we represent BP as, Substituting Eq. 2 back in Eq. 1, we obtain our final formulation, A venn diagram visualization is presented in Fig. 1 . We qualitatively explain all the contrastive terms. Highlights features that answer 'Why P or Q?'. This term contrastively leads to either decisions of P or Q. In the binary setting, we approximate this to be all possible features U. Borrowing notations from Section 2, CP,Q is obtained by backpropagating a loss L(y, [1, 1] ) to obtain a contrast-importance score α P,Q k . C P ,Q : Highlights features that answer 'Why neither P nor Q?'. The features in this term do not increase the probability of either P or Q. C P,Q is obtained by backpropagating a loss L(y, [0, 0]) to obtain a contrast-importance score α P ,Q k . C P,P : Highlights features that answer 'Why not P with 100% confidence?'. Hence, it highlights all unresolved causal features. C P,P is obtained by backpropagating a loss L(y, [1, 0] ) to obtain a contrast-importance score α P,P k Continuing the notations from Section 2, the implementation equivalent of Eq. 3 is given by, where α k represents the importance score from Grad-CAM and α c k represents the normalized importance score from contrast maps. The overall negative sign occurs because α are gradients whose directions are opposite to the feature minima. The final map is normalized and is visualized. A representative COVID negative scan and its Grad-CAM [8] and Grad-CAM++ [13] explanations are shown in Figs. 2c, 2d, and 2e respectively. The causal map from Eq. 4 and contrastive maps CP,Q and C P,P are visualized in Figs. 2f, 2g and 2h respectively. Note that while CP,Q appears similar to G, CP,Q is biased by normalization and its α k values are lesser. Effect of number of classes: In a binary classification setting, we need four feature maps -one Grad-CAM and three contrastive maps to extract causal features. These are obtained by backpropagating {(0, 0), (1, 0), (1, 1) and the logit for Grad-CAM. Hence, we backpropagate the power set of all possible class combinations. This translates to 2 N backpropagations for N classes. Therefore, this technique is suitable for a limited class scenario. In this section, we detail the experiments to validate the causal nature of our proposed features. We perform two sets of experiments to validate within-network and inter-network causality. The COVID-19 dataset [6] consists of 349 COVID positive CT scans and 463 COVID negative CT scans. We train ResNets-18,34,50 [1] and DenseNets-121,169 [23] as described in [7] . The authors in [17] propose two causal metrics -deletion and insertion. In deletion, the identified causes are deleted pixel by pixel and the probability of predicted class, as a function of the fraction of the removed pixels, is monitored. In insertion, the non-causal pixels are added and the increase in probability as a function of added fraction of pixels is noted. However, in a binary setting, the probability for a class rarely decreases to a large extent even after removing a majority of the pixels. Hence, we modify the deletion and insertion setup to measure accuracy instead of probability on masked images. A threshold is applied on the Grad-CAM [8] , Grad-CAM++ [13] and proposed causal maps. For deletion, the pixels greater than the given threshold are made equal to 1 with the rest being 0. And viceversa for insertion. The binary mask is then multiplied with the original input image and the masked image is passed through the model. The model's prediction is noted. This is conducted on all images in the testing set and the average test accuracy is calculated. The experiment is conducted with 81 thresholds ranging from 0.1 to 0.8 with an increment of 0.01. The average accuracy in each case is noted. From Figs. 2f and 3 , we see that the area highlighted by the proposed causal features is lesser than the compared methods. To objectively measure this area, we encode the original and masked images using Huffman coding [24] as H f and Hm respectively. The ratio of the bits is taken as H = Hm/H f . Each average accuracy for a threshold is now associated with an H. All 81 accuracies are plotted as a function of their H and depicted in Fig. 2a . Consider two points DC and DG on the proposed causal and Grad-CAM curves respectively. These points depict roughly 65% averaged accuracy. From the corresponding bit rates in the x-axis, the causal features achieve this accuracy at a lower bit rate compared to Grad-CAM. Hence, dense causal features are encoded by lesser bits in the proposed method. This is validated in the insertion plot in Fig. 2b as well. Table 1 . It can be seen that the huffman ratio for the proposed method is lesser than Grad-CAM for all thresholds. Hence, it is able to identify dense causal features from Grad-CAM. The averaged accuracy of masked images is also shown for 4 other networks. In 19 of the 20 categories, the proposed causal feature masked images outperform Grad-CAM feature masked images with a lesser huffman ratio. In Table 2 , we extract masks using ResNet-34, perform deletion based on shown thresholds and obtain huffman ratios for all test images. These masked images are then passed into the corresponding networks and the accuracy results are shown. In 10 of the 20 categories, the proposed causal features outperform Grad-CAM features. The authors in [17] argue that humans must be kept out of the loop when evaluating causality. However, by definition, explanations are rationales used by networks to justify their decisions [25] . These justifications are made for the benefit of humans. Such justifications are required in fields like biomedical imaging where deep learning tools are used as aids by medical practitioners. We visualize Grad-CAM and their underlying causal features from the proposed technique in Fig. 3 . Both original scans are from COVID positive patients. In Fig. 3a , Grad-CAM fails to highlight the circled red region that depicts COVID. More importantly, the extracted causal features are at the bottom right. Feeding the masked image into ResNet-18, the network classifies both correctly but with a higher confidence in the causal features. In Fig. 3b , we pick a scan whose Grad-CAM and causal features were classified with the same confidence but from different regions within the scans. These results suggest that it is the context features that add human interpretability and causal features that aid classification. In real-world biomedical applications like in the considered COVID-19 detection, it is imperative to identify and make decisions based on causal features. It merits further study into designing better networks whose causal features are more human interpretable, similar to Grad-CAM's causal and context feature set. In this paper we formalize the causal and context features that a neural network bases its decision on. We express context features in terms of contrastive features between classes that the neural network has implicitly learned. This allows separation between causal and context features. Grad-CAM is used as the explanatory mechanism from which causal features are extracted. We validate and establish the trasfer-abilty of these causal features across networks. The visualizations suggest that the causal regions that a neural network bases its decision on is not always human interpretable. This calls for more work in designing human-interpretable causal features especially in fields like biomedical imaging. Deep residual learning for image recognition Imagenet large scale visual recognition challenge Relative afferent pupillary defect screening through transfer learning A machine-learning benchmark for facies classification Towards understanding common features between natural and seismic images Covid-ct-dataset: a ct scan dataset about covid-19 Sample-efficient deep learning for covid-19 diagnosis based on ct scans Gradcam: Visual explanations from deep networks via gradientbased localization Probabilistic causation Models, reasoning and inference Inferring causal networks from observations and interventions Discovering causal signals in images Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks Striving for simplicity: The all convolutional net Contrastive explanations in neural networks On reichenbach's common cause principle and reichenbach's notion of common cause Rise: Randomized input sampling for explanation of black-box models The role of context in object recognition Implicit saliency in deep neural networks Distorted representation space characterization through backpropagated gradients Backpropagated gradient representations for anomaly detection Gradients as a measure of uncertainty in neural networks Densely connected convolutional networks A method for the construction of minimum-redundancy codes