key: cord-0616761-vmj95cui authors: Ali, Ameen; Shaharabany, Tal; Wolf, Lior title: Explainability Guided Multi-Site COVID-19 CT Classification date: 2021-03-25 journal: nan DOI: nan sha: 55cc075a19c232ac4b21e1753768d5ebf047aac7 doc_id: 616761 cord_uid: vmj95cui Radiologist examination of chest CT is an effective way for screening COVID-19 cases. In this work, we overcome three challenges in the automation of this process: (i) the limited number of supervised positive cases, (ii) the lack of region-based supervision, and (iii) the variability across acquisition sites. These challenges are met by incorporating a recent augmentation solution called SnapMix, by a new patch embedding technique, and by performing a test-time stability analysis. The three techniques are complementary and are all based on utilizing the heatmaps produced by the Class Activation Mapping (CAM) explainability method. Compared to the current state of the art, we obtain an increase of five percent in the F1 score on a site with a relatively high number of cases, and a gap twice as large for a site with much fewer training images. Deep neural networks are currently the leading image classification method. Their ability to generalize is well-documented. However, in many medical imaging domains, one faces challenges that reduce the effectiveness of the generic solutions. First, due to the cost of acquisition, privacy issues, and the expertise required for labeling, the typical datasets are smaller than those available for many other computer vision tasks. Second, in medical images, the exact capturing apparatus, its setting and its operators all can greatly affect the distribution of the obtained images, causing a sizable domain shift. Third, many diseases are manifested through symptoms that are well localized in images, while the supervision is given at the image-level. In this work, we demonstrate that explainability methods, which link the classification outcome to specific image regions, can provide an important building block for overcoming the three issues. First, the heatmap obtained from such methods serves as the basis of an augmentation method called SnapMIX [17] , which we demonstrate is also effective for the COVID-19 classification task we study in this work. Second, the heatmap can provide a delineation of whether or not local image patches are strongly linked to the obtained classification. By requiring that image patches of similar relevancy have similar embedding, we can improve the classification performance. Third, we can use the heatmap in order to validate, at test time, the stability of the obtained classification by perturbing the image locations that are the most relevant to the prediction. If the majority of the perturbations do not support the prediction, we flip the predicted label. We evaluate our method with well-established benchmarks for the classification of Computed Tomography (CT) scans as COVID-19 positive or COVID-19 negative, and present clear evidence for the utility of our method. The gap in performance we obtain is larger than the variance between the state of the art methods. On site A, in which performance (F1 score and accuracy) is over 90%, we improve to over 95%. On site B, in which the performance levels are in the high 70 percentages, we obtain results of almost 90%. The SARS COV-2 infection (COVID-19) has a devastating impact on the respiratory system and has caused an enormous number of deaths. In the last year, many deep learning methods were developed for classifying COVID-19 in 2D or 3D medical images [8, 28, 41, 36] . Some recent methods use transfer learning from models pretrained on ImageNet [13, 1] . Following Wang et al. [37] , we study classification in two CT datasets. To overcome the domain shift, their approach adds a contrastive loss that decreases the differences between the latent space distributions. Unlike previous work in the domain of CT diagnosis of COVID-19, our method employs a generic ResNet architecture and our contribution lies solely in the training procedure and in the inference procedure. Many augmentation approaches were developed over the years as a form of regularization. These include geometric transformations [35] and color space transformations [38] , which have been shown to improve many medical applications [20] . Data mixing approaches create virtual samples that combine multiple images from different categories. The generated image has a fuzzy label from the two categories. In MixUp [10] , the augmented image is a linear interpolation between two different images. The fuzzy labels are computed using the same weights as the images. Cutmix [40] extracts a box from one image and pastes it to the second. The fuzzy labels are proportional to the area of the box. SnapMix [17] is similar to Cutmix, except that the area of the patch is replaced by the sum of the CAM activations within the extracted and the masked patches. It was shown to be highly effective on find-grained classification datasets of natural images. Here, it is applied to the binary classification of medical images. The task of generating a heatmap that indicates local relevancy from the perspective of a CNN observing an input image has been tackled from many different directions, including gradient-based methods [32, 33, 31] , attribution methods [2, 24, 25, 11] , and image manipulation methods [6, 7, 22] . The CAM method [43] is based on the gradient of the loss with respect to the input of each layer. CAM and its extension GradCAM [31] have been used by downstream applications, such as weakly-supervised semantic segmentation [19] . Here, we make a novel use of CAM for creating more effective patch embeddings and for test time augmentation. Contrastive learning The loss we employ between patches of different levels of relevancy is related to contrastive learning methods that have recently made a large impact in the field of self-supervised learning, where it is often used to link an image to its transformed version [14, 23, 3] . Our work is applied at the patch level. Contrastive learning has emerged in metric learning [4] and subsequently in unsupervised representation learning [12] . The learned embedding brings closer associated samples, while distancing other samples. In our case, association is being determined by the CAM-derived relevancy. Our experiments utilize a Resnet-50 network [15] , trained with the conventional binary cross entropy loss L BCE , as the baseline classifier. We then apply (i) Snap-Mix [17] , (ii) a novel optimization term called the Contrastive Patch Embedding loss, and (iii) a novel test time voting procedure. All three techniques utilize the heatmaps produced by the CAM method [43] . The SnapMix method is illustrated in Fig. 1 . It combines two training images, depicted in panels (a) and (b) by considering a random box in each image (marked in red). The importance of each of the two boxes is evaluated by integrating the CAM scores in each box (panels c,d). The virtual sample is generated by pasting the box from the second image onto the selected box of the first image (panel e), and labeling the new image proportionally to the integrated CAM scores. More specifically, a ratio (ρ a ,ρ b ) is computed for each image by considering the sum of all CAM scores in a box over the sum of the CAM scores of the entire image. The labels are then linearly interpolated between the labels of the two images, using the the complement of the obtained box ratio in the first image (1 − ρ a ) and the ratio in the second image ρ b . Unlike the original experiments in [17] , which considered datasets with many classes, in our case the problem is binary. It often happens that both images are of the same class. Moreover, since we train using images from two sites, the virtual images created can potentially play a role in overcoming the domain shift. The input images we receive are of size 224 × 224, the receptive field of the ResNet-50 architecture is of size 32, and the embedding is of spatial dimensions of 7 × 7 with a depth of 2, 048. For each of the 7 × 7 = 49 vectors in R 2048 , we compute the sum of the CAM activations in the associated patch of size 32 × 32. We then select four vectors out of the 49: two with the highest sum of activations u 1 and u 2 , and two with the lowest sum v 1 and v 2 . The embedding loss we propose is a contrastive loss [39, 14, 26] that considers the dot products between the four vectors. This loss brings together the two most label-supporting embedding-vectors and two most label-opposing embedding-vectors. At the same time, it distances the top label-supporting embedding-vectors form the pair of vectors that support the alternative label. It may be the case that the decision for a certain label is based on local artifacts that bias the network into giving the wrong prediction. In order to avoid such cases, we classify each image k + 1 times: using the entire image, and when masking one out of k different patches. For this purpose, we divide the image into small non-overlapping patches of size 8 × 8, obtaining a grid of size 28 × 28. For each cell in the grid, we compute the sum of the CAM activations. We then create k = 31 alternative images, by masking out sequentially the k patches with the highest sum of activations. In the first alternative image, we mask out the patch with the highest CAM scores; in the second, we mask out the two patches with the highest CAM scores; and so on. See Fig. 2 for an illustration. The label that we report is obtained by performing voting among the classifier output of the k images. A supporting vote occurs when the pseudo-probability obtained from the network classifier is at least θ = 0.2 if the original image has a positive label (i.e., a pseudo-probability larger than 0.5), or lower than 1 − θ for Removing even a small number of patches (Images 1-3) increased this probability to be above 0.5. Subsequently, as more and more patches were removed, the probability of a positive case increased further, and became higher than 1 − θ, see the last derived images (out of k = 31 images), Images 28-31. images with negative samples. If more than half the k votes are not supporting, we flip the label. In other words, if the inferred labels of more than half out of the k alternative images contradict, with a high certainty, the label the classifier assigns to the entire image, we flip the predicted label of the image. Data We evaluate the proposed method on three datasets, the first two contains CT images of patients who are COVID-19 positive or negative. The SARS-CoV-2 dataset (denoted as site A) consists of 2,482 CT images from 120 patients, in which 1252 are positive with COVID-19. The 1,230 negative samples are inflicted with other lung disease. The resolution of these images vary between 119 × 104 and 416 × 512. The COVID-CT dataset [42] (denoted as site B) is much smaller and includes 349 CT images from 216 COVID-19 positive patients and 397 CT images from 171 control patients. The resolution of the images of site B ranges from 102 × 137 to 1853 × 1485. Following [37] , the images of both datasets are resized to a fixed resolution of 224 × 224 and are intensity normalized to zero mean and unit variance. The classification accuracy, F1 score, Sensitivity, and Precision are reported in percents using the train/test splits of the different datasets. The third dataset we employ is COVIDx-CT [9] this dataset is considered one of the largest in terms of the number of annotated samples provided, containing For the quantitative analysis we report accuray as well as sensitivity and PPV (positive predictive value) for each infection type at the image level. Implementation Details The architecture of our model is based on ResNet50, followed by an MLP classifier, the ResNet model is initialized with pretrained ImageNet weights. We train the model for 200 epochs. The cross entropy loss is used unweighted on the original samples or on virtual SnapMix samples, as dictated by a beta distribution with a parameter of α = 1, which is the default parameter in [17] . The L CPE loss is applied to all samples and is summed, unweighted with the cross entropy loss. Baseline methods The first two baseline methods used for sites A and B are methods that address domain shift in medical images. Series Adapter [29] and Parallel Adapter [30] include a domain adapter model that is based on a filter bank, in order to learn a joint representation from multiple datasets. MS-Net [21] was originally developed for a multi-site prostate segmentation task. It uses domain-specific auxiliary decoders. For classification tasks, each site is associated with an auxiliary classification head. The results for these three methods are obtained from [37] . The single and joint methods from [36] , employ an architecture called Covidnet. The difference is whether the method is trained on each dataset separately or not. It was also rerun in [37] , using a modified architecture (redesign). The SepNorm method of [37] uses features that are normalization for each site separately. It is further augmented with a contrastive loss that minimizes the domain shift ("SepNorm + Contrastive"). We present results for the ResNet-50 based architecture that our method utilizes ("Baseline architecture"), and also study the effect of our CPE loss (Eq. 1) on it ("Baseline+CPE loss"). Results are also presented when augmenting this architecture with the SnapMix method. As additional ablations, we present result for SnapMIX combined with either the contrastive loss of [37] ("SnapMix+Contrastive loss") or with our CPE loss ("SnapMix + CPE"). Finally, we present our full method, which includes SnapMix augmentation, the CPE loss, as well as the CAM-driven test time augmentation and voting. For the COVIDx-CT dataset we compare our method with the reported baselines in [9] , the COVIDNet-CT baseline [9] was pre-trained on ImageNet [5] and later finetuned on COVIDx-CT [9] dataset using stochastic gradient descent with momentum [27] . We also compare our model with existing models for image recognition (ResNet50 , EfficeintNet-B0 , NASNet-A-Mobile [16, 44, 34] ) for image recognition finetuned on COVIDx-CT dataset. The results are reported in Tab. 1 for site A, and Tab. 2 for site B. Evidently, for both sites, the baseline architecture is already competitive with the best method from the literature, which is SepNorm with the Contrastive loss. In site A the baseline is slightly inferior, and for site B it is considerably preferable. Adding the CPE loss (Sec. 3.2) improves the results on both sites. So does the SnapMix augmentation, by a larger amount. The two contributions are complementary and adding both the CPE loss and SnapMix provides considerably better results than both separately in site A. In site B, the combination of both provides a slightly higher F1 score than either contributions alone. However, SnapMix by itself is slightly better in terms of the three other scores. The ablation of using the contrastive loss of [37] combined with the SnapMix technique hurts the F1 performance relative to SnapMix in site A, by increasing the precision on the expense of the recall. On site B, it hurts all four scores. Our complete method, which adds the test time augmentation of Sec. 3.3 on top of SnapMix and the CPE loss, obtains the best accuracy, recall, and F1 score among all methods. Its precision is slightly less than the best ablation method. However, the gap in performance in the F1 score (which combines both precision and recall) is substantial in comparison to the ablation method with the highest precision (almost 5% in site A, and 1.5% in site B). In Tab 3 we show the results for the COVIDx-CT dataset, as shown our full model achieves superior performance over all of the reported baselines. Parameter Sensitivity SnapMix employs the default augmentation parameters prescribed by [17] . The CPE loss is defined without a temperature parameter that is commonly used in other contrastive learning methods and it employs the minimal number of patches. It is, therefore, virtually parameter-free. The parameter sensitivity of the CAM-driven test-time voting is explored in Fig. 3 , in which performance without this voting ("SnapMix+CPE") is depicted as a dashed horizontal line. When varying the number of augmented images k (panel a), we observe that for any value of k, there is a performance boost for site B, and this is maximized between k = 30 and k = 35. The performance boost for site A is smaller for all k, and peaks at the value of k = 31. However, no value of k hurts the performance in site A. Varying the value of the probability threshold depicted in Fig. 3(b) , shows that there is a positive benefit for all tested values θ ∈ {0.1, 0.2, 0.3, 0.4, 0.5} considering site B. The largest contribution is for the value of 0.4. For site A, however, the contribution is positive only for conservative values (smaller than 0.3, when flipping the label of the test image becomes less frequent). The value of θ = 0.2 provides a small boost to site A and is also the 2nd highest for site B. We present a method of COVID-19 detection in CT scans. The method tackles many of the challenges faced by medical imaging classification systems: distribution shifts across sites, limited training data, and the lack of region based tagging. We propose to combine three different techniques, which have in common the reliance on the heatmap produced by the CAM explainability method. The first method is a powerful regularizer called SnapMix, which was previously used for fine-grained classification. The second is a novel patch embedding method that considers the two patches that show the strongest CAM activations in a given image and the two that present the lowest activations. Finally, we propose a voting method that constructs multiple masked images based on the CAM score. Taken together, our method obtains, despite using a generic network architecture, state of the art results on the two publicly available COVID-19 CT datasets. The gap in performance is extremely sizable, and we demonstrate the individual contribution of each component to it. Extracting possibly representative covid-19 biomarkers from x-ray images with deep learning approach and image data related to pulmonary diseases On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation A simple framework for contrastive learning of visual representations Learning a similarity metric discriminatively, with application to face verification Imagenet: A largescale hierarchical image database Understanding deep networks via extremal perturbations and smooth masks Interpretable explanations of black boxes by meaningful perturbation Rapid ai development cycle for the coronavirus (covid-19) pandemic: Initial results for automated detection & patient monitoring using deep learning ct image analysis Covidnet-ct: A tailored deep convolutional neural network design for detection of covid-19 cases from chest ct images Mixup as locally linear out-of-manifold regularization Visualization of supervised and self-supervised neural networks via attribution guided factorization Dimensionality reduction by learning an invariant mapping Finding covid-19 from chest x-rays using deep learning on a small dataset Momentum contrast for unsupervised visual representation learning Deep residual learning for image recognition Identity mappings in deep residual networks Snapmix: Semantically proportional mixing for augmenting fine-grained data Weakly-supervised semantic segmentation network with deep seeded region growing Tell me where to look: Guided attention inference network A survey on deep learning in medical image analysis Ms-net: Multi-site network for improving prostate segmentation with heterogeneous mri data A unified approach to interpreting model predictions Self-supervised learning of pretext-invariant representations Explaining nonlinear classification decisions with deep taylor decomposition Relative attributing propagation: Interpreting the comparative contributions of individual units in deep neural networks Representation learning with contrastive predictive coding On the momentum term in gradient descent learning algorithms A modified deep convolutional neural network for detecting covid-19 and pneumonia from chest x-ray images based on the concatenation of xception and resnet50v2 Learning multiple visual domains with residual adapters Efficient parametrization of multi-domain deep neural networks Gradcam: Visual explanations from deep networks via gradient-based localization Learning important features through propagating activation differences Full-gradient representation for neural network visualization Efficientnet: Rethinking model scaling for convolutional neural networks Improving deep learning using generic data augmentation Covid-net: A tailored deep convolutional neural network design for detection of covid-19 cases from chest x-ray images Contrastive cross-site learning with redesigned net for covid-19 ct classification Deep image: Scaling up image recognition Unsupervised feature learning via nonparametric instance discrimination Cutmix: Regularization strategy to train strong classifiers with localizable features Clinically applicable ai system for accurate diagnosis, quantitative measurements, and prognosis of covid-19 pneumonia using computed tomography Covid-ct-dataset: a ct scan dataset about covid-19 Learning deep features for discriminative localization Learning transferable architectures for scalable image recognition This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant ERC CoG 725974).