key: cord-0564712-n5mdm7od authors: Wang, Ella Y.; Som, Anirudh; Shukla, Ankita; Choi, Hongjun; Turaga, Pavan title: Interpretable COVID-19 Chest X-Ray Classification via Orthogonality Constraint date: 2021-02-02 journal: nan DOI: nan sha: 7e7d09b3a4ecdce8d65b21ca244efc27c3403eba doc_id: 564712 cord_uid: n5mdm7od Deep neural networks have increasingly been used as an auxiliary tool in healthcare applications, due to their ability to improve performance of several diagnosis tasks. However, these methods are not widely adopted in clinical settings due to the practical limitations in the reliability, generalizability, and interpretability of deep learning based systems. As a result, methods have been developed that impose additional constraints during network training to gain more control as well as improve interpretabilty, facilitating their acceptance in healthcare community. In this work, we investigate the benefit of using Orthogonal Spheres (OS) constraint for classification of COVID-19 cases from chest X-ray images. The OS constraint can be written as a simple orthonormality term which is used in conjunction with the standard cross-entropy loss during classification network training. Previous studies have demonstrated significant benefits in applying such constraints to deep learning models. Our findings corroborate these observations, indicating that the orthonormality loss function effectively produces improved semantic localization via GradCAM visualizations, enhanced classification performance, and reduced model calibration error. Our approach achieves an improvement in accuracy of 1.6% and 4.8% for two- and three-class classification, respectively; similar results are found for models with data augmentation applied. In addition to these findings, our work also presents a new application of the OS regularizer in healthcare, increasing the post-hoc interpretability and performance of deep learning models for COVID-19 classification to facilitate adoption of these methods in clinical settings. We also identify the limitations of our strategy that can be explored for further research in future. Deep learning techniques have been increasingly used as an adjunct tool in medical science for developing automated solutions for disease diagnosis. For example, they have been used to classify brain disease [13] , segment lung and fundus images [24] , and detect breast cancer [19] . More recently, due to the wide spread of COVID-19, deep networks have also shown to be useful in developing tools for automated detection of such cases from the chest X-ray images [17, 22, 14, 9] . Thus, providing assistance in accurate and rapid diagnosis that reduces the burden on doctors as well as overcome the limitations of time consuming methods like Reverse Transcription-Polymerase Chain Reaction (RT-PCR). COVID-19 is often diagnosed with a Reverse Transcription-Polymerase Chain Reaction (RT-PCR) using upper and lower respiratory specimens [21] . However, the low sensitivity of RT-PCR (60-70%), high false negative rates, long processing times, and shortages of testing kits hinder diagnosis and cause delays in starting treatment [10, 25] . In contrast, radiologic imaging such as computed tomography (CT) and X-ray are promising diagnostics for COVID-19. X-ray evaluations are relatively easy and fast to perform and achieve much higher sensitivity than RT-PCR, making them a more reliable and useful technology for early detection of COVID-19 [2] . CT is widely used in countries such as Turkey where testing kits are largely unavailable. Researchers have found that consolidation, ground-glass opacities, crazy paving pattern, and reticular pattern are common features in CT images of patients with COVID-19; Bernheim et al. [3] observed bilateral and peripheral ground-glass opacities (GGO) as key characteristics, and Li and Xia [12] identified GGO and consolidation as observations. However, such subtle irregularities can only be detected by radiology experts and require valuable time, delaying diagnosis and treatment. Although deep learning models have achieved significant performance gains in medical tasks, they have not been readily adopted in clinical settings due to their limited reliability, generalizability and interpretability. This limits the practical application of deep learning in healthcare due to a lack of understanding in such methods. Therefore, in order to facilitate the adoption of deep learning models it is increasingly important to elucidate and confer trustworthiness in how these methods work. Several deep learning approaches to automated detection of COVID-19 from chest X-ray classification have recently been developed [17, 22, 14, 9] . However, the post-hoc interpretability of these models is rather limited as regions of interest tend to be delocalized, resulting in less explainable interpretations of deep-classification networks in terms of semantic localization when input activation maps are visualized by the technique of Grad-CAM, which makes it difficult for radiologists to understand model decisions [18] . Furthermore, there is still much room for improvement in the overall performance and accuracy of existing models. In this work, we aim to improve the performance of chest X-ray classification and also improve the interpretability to aid in identifying COVID cases. In the medical field, meaningful interpretability is especially important to ensure improved comprehension and explanation of model predictions for end users, such as radiologists. Interpretable deep learning models would assist healthcare personnel in driving more logical and data-driven actions, improving the quality of healthcare. In this work, we make use of OS parameterization to effectively train deep neural network for automated detection and classification of COVID-19 in chest X-ray images. Our work is primarily driven by the findings of earlier works by Shukla et al. [20] . and Choi et al. [4] that have used OS constraints to improve the generalization of learned representations. Our implementation of OS constraints for chest X-ray image datasets [5, 23] yields improvements in classification performance and better localization and preservation of regions of interest in Grad-CAM heatmap visualizations compared to baseline models. Our OS-constrained model achieved slightly higher accuracy than baseline models [17] , and we observe that the OS regularizer resulted in higher activation around lung areas and reduced focus on the background. These findings contribute to greater posthoc interpretability and performance of deep learning models for detecting COVID-19. Our approach may also provide radiologists more insight into understanding classification decisions and lead to greater acceptance of deep learning models in clinical settings. Several studies and research works have been published on the diagnosis of COVID-19 from X-ray images. Hemdan et al. [9] proposed a COVIDX-Net model made up of seven CNN models to detect COVID-19. Minaee et al. [14] prepared a dataset of 5000 chest X-rays and trained four popular CNNs, reporting that ResNet18 and SqueezeNet obtained the best performance. Wang and Wong [22] proposed a COVID-Net deep learning model to diagnose COVID-19 from X-ray images, which achieved 92.4% accuracy in identifying healthy, non-COVID pneumonia-infected, and COVID-19-infected patients. However, these methods used limited data to develop the models. Most notably, T. Ozturk et al. [17] proposed the DarkCovidNet model, which has an end-to-end architecture without the need for manual feature extraction methods. Trained on a more extensive dataset of 1125 chest X-ray images [5, 23] , the model achieved superior performance compared to other studies, obtaining 98.08% and 87.02% accuracy for two-and threeclass classification, respectively. In existing approaches, GradCAM [18] heat maps are used to visualize the parts of an image contributed towards the model's classification. We observe from the results of previous works that the heat maps generated from current deep learning models are highly varied. Many visualizations point to delocalized regions of interest outside the lungs, including the shoulder bone and lung bone, despite these areas being unaffected by COVID-19. Such varied heat maps are not meaningful for post-hoc interpretation by radiologists and provide unclear insight regarding which regions of an image contributed to the final prediction. In this section, we provide a brief overview of the two components that are used in developing our strategy for chest X-ray classification for identifying COVID-19 cases. Ozturk et al. [17] proposed DarkCovidNet model that classifies chest X-ray images into three classes -nofindings, pneumonia, and COVID-19. We used DarkCovid-Net model as our baseline model due to its superior performance over existing methods and modify it to incorporate the OS constraint. The emergence of these methods is due to preference of X-ray imaging over CT scans due to their lower radiation dose. The DarkCovidNet model is shown to perform well with sufficient sensitivity in tasks such as detecting ground-glass opacities (GGO) in patients with COVID-19 [27] . Further, the DarkCovidNet model was trained with a comparatively larger dataset when compared to other counterpart methods [22, 9, 15] , developed for COVID-10 identification from Chest X-Ray images. Input images are of shape 256x256x3. The DarkCovid-Net model consists of 17 convolution layers and 5 pooling layers. Each DarkNet layer consists of a convolution layer, batch normalization, and a LeakyReLu operation [26] . Batch normalization standardizes inputs, stabilizes the model, and reduces training time. LeakyReLU is a version of the ReLU operation [1] which has a small epsilon value to prevent dying neurons. In the DarkCovidNet model, max pooling is used in all of the pooling operations. The model ends with Flatten and Dense layers that produce the outputs. The last convolutional layer of the DarkCovidNet model for three classes uses 3 × 3 × 1 convolutional filter with height 3, width 3, and depth 1. With this setup, the baseline DarkCovidNet model has a total of 1,170,811 parameters. This convolutional layer is modified in our experiments to incorporate the OS constraint that requires the representation to be split into k feature blocks of equal dimensions. We make use of the OS parameterization proposed by Shukla et al. [20] in generative model setting and adapt it for our classification setting. For a given input image, let Z ∈ R m represent the output of a specific layer from the CNN model, where m is the feature dimension. We parti-tion this representation in k feature blocks as Z ∈ R d×k = [z 1 , z 2 , . . . , z k ], where k represents the number of partitions and d is the dimension of each partition that is obtained as d = m k . To make the matrix Z ∈ R m as orthogonal as possible, we regularize the off-diagonal elements in the matrix to be zero. Applying this orthogonality condition on the matrix Z ∈ R m , we arrive at the simple orthonormality term shown below Here, L OS represents the OS regularizer and I represents the k × k identity matrix, with · F being the Frobenius norm. The OS regularizer is applied along with the standard cross-entropy loss function. This OS constraint was recently employed by Choi et al. [4] and have that the network learns more diverse representations, reducing model calibration error while effectively improving the semantic localization. These improvements were shown on standard computer vision daatsets like CIFAR10 [11] , CIFAR100 [11] , SVHN [16] , and tiny Ima-geNet datasets [6] . In this work, we explore and harness the capabilities of OS constraints for medical images to improve the intepretability of results, hence making them acceptable to medical practitioners. Deep networks are conventionally trained using the categorical-cross-entropy loss function for classification task. However, models obtained using this loss function tend to exhibit low interpretability, feature redundancy, and poor calibration. Instead, we approach this problem with orthogonal-sphere (OS) constraints. The OS parameterization discussed in subsection 2.3 is applied to output of the flatten layer following the last convolutional layer of the DarkCovidNet model. In doing so, we sought to reduce the number of correlated features learnt by deeper layers in the network. Our training pipeline for the proposed implementation of the OS regularizer is depicted in Figure 2 . The OS regularization function was used together with regular categorical cross-entropy loss. Thus, with L OS representing the OS regularizer, our total loss function can be characterized as Here, 0 ≤ λ ≤ 1 is a trade-off parameter. Our experiments are conducted on the same dataset as used by Ozturk et al. [17] . The dataset has three classes: COVID-19 cases, pneumonia and healthy or no-finding. The images for COVID-19 class are obtained from an open source database of COVID-19 chest X-ray images collected by Cohen et al. [5] . This database is continuously updated with images submitted by researchers. Currently, there are 132 X-ray images of COVID-19 diagnosis in the database, out of which 125 are confirmed to be positive. We use these 125 images for the COVID-19 class in our experiments. In the healthy (no-findings) and pneumonia classes, 500 chest X-ray images for each class were obtained randomly from the ChestX-ray8 database collected by Wang et al. [23] , making a total of 1125 images in the dataset. We performed experiments to classify COVID-19 from chest X-ray images in two different scenarios. First, we trained the DarkCovidNet model (Baseline) and OSconstrained model (Baseline + OS) to classify X-ray images into three classes: COVID-19, Pneumonia, and No-Findings. Secondly, the performance of these two models was evaluated in a classification task with two classes: COVID-19 and No-Findings. The performance of the models are evaluated using 5-fold cross-validation -the models are evaluated for each fold, and the average classification performance of the model is calculated. We use a 80/20 split for training and testing. All the experiments are conducted using a NVIDIA Tesla P100 GPU and Python 3.7 with Tensorflow 2.3.0. Our models are trained for 100 epochs using the Adam optimizer, batch-size = 32, and initial learning-rate = 0.003. We used the default Adam momentum parameters: β 1 = 0.9 and β 2 = 0.999. Following the implementation of the DarkCovid-Net model by T. Ozturk et al. [17] , we apply exponential learning rate decay to decay every 1000 steps with a base of 0.7. We apply batch-normalization with leaky ReLu activation with α = 0.1. To account for the class imbalance due to smaller number of COVID-19 images, i.e. 125 samples compared to 500 in the No-Findings and Pneumonia classes, we assign COVID-19 class four times the weight of the other two classes. The baseline DarkCovidNet model is trained using categorical cross-entropy loss function, while the OS-constrained model is trained by augmenting this loss with the orthogonality loss. During network training, we use random horizontal flipping and slight vertical and horizontal image translation for data augmentation. When interpreting experimental results, the label "baseline" represents the original DarkCovidNet model trained with only crossentropy loss; the label "+OS" signifies that OS-constraint applied on the baseline model, and so both cross-entropy loss and the OS regularizer are applied when training the Figure 5 . Confusion matrices for 3-class classification. The first row represents performance of regular baseline and OSconstrained models. The second row represents results obtained from the models with data augmentation applied. Figure 6 . Confusion matrices for 2-class classification. The first row represents performance of baseline and OS-constrained models. The second row represents results obtained from the models with data augmentation applied. To optimize the OS-constrained model, we performed experiments to determine the value of k that results in the highest classification accuracy. As mentioned previously, k constrained model (+OS) and OS-constrained model with data augmentation (+OS+Aug), achieve marginally higher performance for k = 6. On the other hand, Figure 3 reports classification results of the OS-constrained model for three classes. With three classes, we find that the average accuracy is highest for k = 3. These respective k values are used in all other experiments. As shown in Figure 3 , the OS-constrained model performs slightly better than the baseline model for all values of k. Taking the optimal value, i.e., k = 3, Table 2 shows the average accuracy, precision, recall, and F1-score across 5 folds for the OS-constrained and baseline models. The DarkCovidNet model obtains an average classification accuracy of 78.49% and 81.69% without and with data augmentation respectively. In comparison, the OS-constrained models obtains an average accuracy of 83.29% and 83.27% in the same scenario, with approximately 3-5 % improvement over the baseline models. It can be noted that data augmentation marginally improves classification performance for both models. Additionally we also computed the confusion matrices are shown in Figure 5 for more detailed analysis of the three class problem. As pointed in [17] , the deep learning model is better at classifying COVID-19 than pneumonia and no-findings classes. These improvements in classification performance in the OS-constrained model can be attributed to more diverse feature representations, reduced model calibration error, and improved robustness by the OS regularizer. Next we evaluate the performance of our OS-regularized model for the two-class classification task, involving only the COVID-19 and No-Findings classes. Figure 4 displays the average accuracy obtained from the OS and baseline models for various k values. Again, we find that the classification performance of the OS-constrained model is consistently higher than the DarkCovidNet model by a slight margin. For 2-class classification, k is optimized at 6, and Table 3 details specific performance metrics across 5 folds. The average accuracy of the OS-constrained model was 99.04% compared to 97.44% by the baseline model and 99.68% compared to 97.92% for the models with data augmentation, reflecting a 1-2 percentage point difference. It can be noted that the performance of both OS-constrained models surpassed the 98.08% accuracy reported by T. Ozturk et al. [17] for the DarkCovidNet model. We have also included in Figure 6 the overlapped confusion matrices obtained over 5 folds, where we find that our OS-constrained model achieved slightly higher performance overall. We obtained Grad-CAM [18] heat maps to visually depict decisions made by the deep learning model. The heatmap reveals regions of the X-ray image which contributed most to the model's classification. The images in Figure 7 represent Grad-CAM visualizations of 6 test images from the chest X-ray dataset, with 2 images per class, obtained from four experimental models for 3-class classification. Similar to the findings of T. Ozturk et al. [17] , the baseline DarkCovidNet model highlights more scattered areas outside the lungs, such as the chest bone, shoulders, and diaphragm, which are generally irrelevant to diagnosis and may hinder post-hoc interpretability. Although applying data augmentation to the baseline model seems to consolidate some regions, overall these areas are not helpful in understanding model decisions. Instead, the OS regularizer captures more exact and localized areas within the lobes of the lungs, suggesting improved semantic interpretation as regions of interest are better preserved. Similar to the baseline model, applying data augmentation to the OSconstrained model helped identify more relevant areas in the image. It can be noted that the OS-constrained model seems to focus more on the right side of the lung when classifying COVID-19, but emphasizes both sides of the lung for the No-Findings and Pneumonia classes. We observe that Input Baseline +Aug +OS +OS +Aug No-Findings Pneumonia Figure 11 . Grad-CAM visualization for baseline and OS-constrained models with and without data augmentation methods. The OS constraint uses k = 3. For each pair of two columns, the first column displays the visualizations obtained from the original images, and the second column represents visualizations obtained from the horizontally flipped images. the Grad-CAM heatmaps obtained from the OS-constrained model highlight very specific lung regions which may help radiologists identify diagnostic features such as groundglass opacities and consolidation [12] . The λ parameter in Eq. 2 governs the contribution of OS constraint during network training. We analyzed the behaviour of network performance for different value of λ in the range [0, 1] Figure 8 shows the average accuracy obtained for different value of λ parameter for 3-class classification performance with k = 2. We observe that the optimal performance of the model is achieved for λ=0.8. This value of λ used for all other experiments. We also evaluate how well models were calibrated using the OS regularizer. Calibration metrics allow us to determine whether the predicted softmax scores obtained from the model are good indicators of the actual probability of the correct predictions. Our models are assessed using the Expected Calibration Error (ECE), Overconfidence Error (OE), and Brier Score (BS) [8, 7] . These calibration metrics is defined as: Here, B m represents the number of predictions falling in bin m and K represents the number of classes. acc(B m ) denotes the accuracy of the model and conf(B m ) denotes the model's average confidence. Figure 9 shows the calibration metric scores obtained from our baseline and OSconstrained models. Note, models with lower calibration scores are better. We find that lower calibration scores are obtained when we implement the OS regularizer with the baseline model, and data augmentation has slightly reduces calibration scores. These findings are especially significant for the 2-class classification task. Figure 10 shows the validation and training accuracy and loss curves for the baseline and OS-constrained models. The accuracy curves reveal that the OS-constrained model tends to achieve higher training and validation accuracy compared to the baseline model throughout the training period. The loss curves for both models are relatively similar, although the validation loss of the OS-constrained model shows slightly more volatility than the baseline model. In this subsection we study the effect of horizontally flipping input images on GRAD-Cam visualizations. Input images were mirrored across the vertical axis for testing. Using the OS-regularized and baseline models for 3-class classification task to obtain predictions, we evaluate the Grad-CAM heatmaps resulting from these modified images. Despite flipping the images, the heatmaps shown in Figure 11 stayed relatively consistent as those obtained from our previous experiments for all models, with highlighted regions only exhibiting slight shifts. For example, the regions emphasized by the OS-constrained models with data augmentation remained concentrated on the right side of the lung in the COVID-19 class. Since these highlighted regions were not mirrored after horizontally flipping the input images, these results suggest that despite improved performance achieved by the OS regularizer, our model still lacks robustness to transformed data. In future research, other techniques may be further explored in conjunction with OSconstraints to improve the robustness of deep learning models. This work was supported in part by NSF RAPID grant 2029044. In this work, we studied orthogonality constraint imposed on a deep learning model to classify COVID-19 cases from chest X-ray images. The proposed OS regularization yields improved performance compared to the baseline DarkCovidNet model, obtaining a classification accuracy of 83.29% over 78.49% for three classes, and 99.04% over 97.44% accuracy for two classes without augmentation. Our OS-constrained model generates more localized and interpretable activation maps that can assist radiologists in understanding classification decisions and improving acceptance of deep learning models in the clinical settings. In future work, it is promising to explore applications of orthogonality constraints in other medical imaging tasks such as the diagnosis of chest-related diseases including pneumonia or tuberculosis. Correlation of chest ct and rt-pcr testing for coronavirus disease 2019 (covid-19) in china: A report of 1014 cases Relationship to duration of infection Role of orthogonality constraints in improving properties of deep networks for image classification Covid-19 image data collection Imagenet: A large-scale hierarchical image database Strictly proper scoring rules, prediction, and estimation On calibration of modern neural networks and Mohamed Esmail Karar. Covidx-net: A framework of deep learning classifiers to diagnose covid-19 in x-ray images Essentials for radiologists on covid-19: An update-radiology scientific expert panel Learning multiple layers of features from tiny images Coronavirus disease 2019 (covid-19): Role of chest ct in diagnosis and management Deep learning on brain cortical thickness data for disease classification Deep-covid: Predicting covid-19 from chest x-ray images using deep transfer learning Automatic detection of coronavirus disease (covid-19) using x-ray images and deep convolutional neural networks Reading digits in natural images with unsupervised feature learning Automated detection of covid-19 cases using deep neural networks with x-ray images Grad-cam: Visual explanations from deep networks via gradient-based localization Deep learning to improve breast cancer detection on screening mammography Product of orthogonal spheres parameterization for disentangled representation learning Diagnosing covid-19: The disease and tools for detection Covidnet: a tailored deep convolutional neural network design for detection of covid-19 cases from chest x-ray images Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases Segmentation-based deep learning fundus image analysis Chest ct for typical coronavirus disease 2019 (covid-19) pneumonia: Relationship to negative rt-pcr testing Empirical evaluation of rectified activations in convolutional network Coronavirus disease 2019 (covid-19): A perspective from china