key: cord-0588523-ykxwiblz
authors: Alshazly, Hammam; Linse, Christoph; Barth, Erhardt; Martinetz, Thomas
title: Explainable COVID-19 Detection Using Chest CT Scans and Deep Learning
date: 2020-11-09
journal: nan
DOI: nan
sha: b902bc1381f0fceaa04308dbac9e0b2306ee5eb9
doc_id: 588523
cord_uid: ykxwiblz

This paper explores how well deep learning models trained on chest CT images can diagnose COVID-19 infected people in a fast and automated process. To this end, we adopt advanced deep network architectures and propose a transfer learning strategy using custom-sized input tailored for each deep architecture to achieve the best performance. We conduct extensive sets of experiments on two CT image datasets, namely the SARS-CoV-2 CT-scan and the COVID19-CT. The obtained results show superior performances for our models compared with previous studies, where our best models achieve average accuracy, precision, sensitivity, specificity and F1 score of 99.4%, 99.6%, 99.8%, 99.6% and 99.4% on the SARS-CoV-2 dataset; and 92.9%, 91.3%, 93.7%, 92.2% and 92.5% on the COVID19-CT dataset, respectively. Furthermore, we apply two visualization techniques to provide visual explanations for the models' predictions. The visualizations show well-separated clusters for CT images of COVID-19 from other lung diseases, and accurate localizations of the COVID-19 associated regions.

mitted by direct and/or indirect contact with infected people through respiratory droplets when they sneeze, cough or even talk [1] [2] [3] . The real-time polymerase chain reaction (RT-PCR) test is the standard reference for confirming COVID-19, and with the rapid increment in the number of infected people, most of the countries are encountering shortage in testing kits. Moreover, RT-PCR testing has high turnaround times and a high false negative rate [4] . Thus, it is highly desirable to consider other testing tools for identifying COVID-19 contaminated patients to isolate them and mitigate the pandemic impact on the life of many people.

Chest computed tomography (CT) is an applicable supplement to RT-PCR testing and has been playing a role in screening and diagnosing COVID-19 infections. In recent studies [5, 6] , the authors manually examined chest CT scans for more than a thousand patients and confirmed the usefulness of chest CT scans for diagnosing COVID-19 with high sensitivity rates. In some cases, the patients initially had a negative PCR test, however, confirmation was based on their positive CT findings. Moreover, chest CT screening has been recommended when patients show symptoms compatible with viral infections, but the result of their PCR test is negative [5, 7] . Nevertheless, diagnosing COVID-19 from chest CT images by radiologists takes time, and manually checking every CT image might not be feasible in emergency cases. Therefore, there is a need for automated detection tools that exploit the recent deep learning techniques and CT images to expedite the process and provide consistent performance.

This paper adopts the most advanced deep Convolutional Neural Network (CNN) architectures, which are top performers in the ImageNet recognition challenge [8] , and presents a comprehensive study for detecting COVID-19 based on CT images. We explore CNN models that have different architectural designs and varying depths to obtain the best detection performance. Even though we conduct our experiments on two of the largest CT scan datasets available for research, their size is still insufficient to train deep networks from scratch. An effective strategy to overcome this limitation is to use transfer learning [9] , where deep networks trained on visual tasks are utilized to initialize networks for different but related target tasks. Most of the published works that applied transfer learning strategies using the ImageNet [10] pretrained networks followed the strict fixed-sized input for each deep network and resized their target images accordingly. We argue that resizing images with different aspect ratios to match a specific resolution can distort the image severely. We address the problem by placing the images into a fixed-sized canvas determined specifically for each CNN architecture, where the aspect ration of the original image is preserved. This has proven to be a less violating procedure and more effective to achieve better results as reported in [11] . Moreover, we utilize the layer-wise adaptive large batch optimization technique called LAMB [12] , which has demonstrated better performance and convergence speed for training deep networks. The performance of the models is measured quantitatively using accuracy, precision, sensitivity, specificity, F1-score and the confusion matrix for each model. Our obtained results indicate the effectiveness of our strategy to achieve state-of-the-art results on the considered datasets.

In order to provide better explainability of the deep models and making them more transparent we apply two visualization techniques. The first approach is the t-distributed Stochastic Neighboring Embedding (t-SNE) [13] , which is a dimensionality reduction and visualization technique for visualizing clusters of instances in a high-dimensional space. The obtained visualizations of the t-SNE embeddings show well-separated clusters representing CT images for COVID-19 and Non-COVID-19 cases. The second approach is the Gradient-weighted Class Activation Mapping (Grad-CAM) [14] , which is a visualization technique for CNNbased models. It provides high-resolution and classdiscriminative visualizations that localize the important image regions considered for the model prediction. The Grad-CAM visualizations show how accurately our models localize the COVID-19 associated regions. Overall, this paper exhibits the following contributions:

-A comparative experimental study is conducted on how well advanced deep CNNs trained on chest CT images can identify COVID-19 cases. To this end, we experiment with 12 deep networks that have different architectural designs and varying depths, and provide quantitative and qualitative analyses.

-We propose a domain adaptation strategy to fine-tune deep networks using custom-sized inputs determined specifically for each architecture, and utilize the LAMB optimizer for training the networks. Our experimental results prove the effectiveness of our optimization configurations to obtain state-of-the-art performance on the considered CT image datasets. Our best models achieve an average accuracy of 99.4% and 92.9%, and average sensitivity rates of 99.8% and 93.7% on the largest datasets of CT images available for research.

-We provide visualizations of the extracted features from different models to understand how deep networks represent CT images in the feature space. The visualizations show well-separated clusters representing the CT images of the different classes, which indicates that our models have learned discriminative features to distinguish CT images of different cases.

-We show discriminative localizations and visual explanations obtained by our models for detecting COVID-19 associated regions in CT images as annotated by expert radiologists.

The rest of the paper is structured as follows. We review the related work in the next section. The deep CNN architectures are described in Section 3 and the methodology to learn discriminative features in Section 4. The experimental settings and the obtained results are reported in Section 5. Finally, we draw the main conclusion in Section 6.

This section highlights some relevant work that adopted deep CNNs for building computer-aided diagnosis (CADs) systems based on medical images. The authors in [15] employed different deep CNN architectures, which were pretrained on the ImageNet dataset [10] , and fine-tuned them on specific CT scans for thoraco-abdominal lymph node detection and interstitial lung disease classification. Their study indicated the effectiveness of deep CNNs for CADs problems even when training data is limited. In [16] , the authors proposed the CheXNet model to detect different types of pneumonia from chest X-ray images. The model consisted of 121-layers and was trained on a large dataset that contained over 100,000 X-ray images for 14 different thoracic diseases. The model showed outstanding detection performance at the level of practicing radiologists.

In the context of the COVID-19 pandemic, extensive research has been conducted to develop automated image-based COVID-19 detection and diagnosis systems [17] [18] [19] [20] [21] . We hereafter review the proposed approaches for reliable detection systems based on chest X-ray and CT-scan imaging modalities. These techniques follow either one of two main paradigms.

On one hand, new deep network architectures have been developed and tailored specifically for detecting and recognizing COVID-19. COVID-Net [22] represents one of the earliest convolutional networks designed for detecting COVID-19 cases automatically from X-ray images. The performance of the network showed an acceptable accuracy of 83.5% and a high sensitivity of 100% for COVID-19 cases. Hasan et al. [23] proposed a CNN-based network named Coronavirus Recognition Network (CVR-Net) to automatically detect COVID-19 cases from radiography images. The network was trained and evaluated on datasets with X-ray and CT images. The obtained results showed varying accuracy scores based on the number of classes in the underlying X-ray image dataset and an average accuracy of 78% for the CT image dataset. Further modifications were applied to COVID-Net to improve its representational ability for one specific image modality and to make the network computationally more efficient as in [24] .

On the other hand, some deep networks have been proposed for similar tasks of automated detection and recognition of COVID-19 cases, however, these networks are based on well-designed and existing CNN architectures, such as ReseNet [25] , Xception [26] and Capsule Networks [27, 28] . The authors in [29] adopted transfer learning from deep networks for automatic COVID-19 detection based on X-ray images from patients with bacterial and COVID-19 pneumonia and normal cases. They reported the best results for the two-and three-class classification tasks with an accuracy of 98.75% and 93.48, respectively. Minaee et al. [30] applied transfer learning by fine-tuning four popular pretrained CNNs to identify COVID-19 infection. They experimented on a prepared X-ray image dataset with 5,000 chest X-rays. Their best approaches obtained an average sensitivity and specificity of 98% and 90%, respectively. Brunese et al. [31] utilized transfer learning with a pretrained VGG-16 network [32] to automatically detect COVID-19 from chest X-rays. On a combined dataset from different sources with X-rays for healthy and pulmonary disease they reported an average accuracy of 97%.

Zhou et al. [33] highlighted the importance of deep learning techniques and chest CT images for differenti-ating COVID-19 pneumonia and influenza pneumonia. The study was conducted on CT images for confirmed COVID-19 patients from different hospitals in china. Their study proved the potential of accurate COVID-19 diagnosis from CT images and the effectiveness of their proposed classification scheme to differentiate between the two types of pneumonia. DeepPneumonia [34] was developed to identify COVID-19 cases (88 patients), bacterial pneumonia (100 patients) and healthy cases (86 subjects) based on CT images. The model achieved an accuracy of 86.5% for differentiating bacterial and viral (COVID-19) pneumonia and an accuracy of 94% for distinguishing COVID-19 and healthy cases. The authors in [35] used CT images to classify COVID-19 infected patients from Non-COVID-19 people utilizing a pretrained DenseNet201 network. The model achieved an accuracy of 96.25%.

Very few studies employed handcrafted feature extraction methods and conventional classifiers. In [36] , texture features were extracted from X-ray images using popular texture descriptors. The features were combined with those extracted from a pretrained Inception V3 [37] using different fusion strategies. Then, various classifiers were used to differentiate between normal Xrays and different types of pneumonia. The best classification scheme achieved an F1-score of 83%. In [38] , the authors proposed an approach to differentiate between positive and negative COVID-19 cases based on CT scans. Different texture features were extracted from CT images with Gabor filters, and then support vector machines were trained for classification. Their proposed scheme achieved an average accuracy of 95.37% and a sensitivity of 95.99%.

The discussion about related works indicates the prominence of deep learning methods to address the task of automated detection of COVID-19. We build on the existing body of published work and adopt advanced deep networks for detecting COVID-19 using CT images. We conduct experiments on two of the largest CT image datasets and compare the performance of 12 deep networks using standard evaluation metrics. We also provide visualizations for better explainability of the resulting models.

This section describes the deep CNN architectures employed to identify COVID-19 using chest CT scans. These networks are state-of-the-art deep models for image recognition. They differ in their architectural design and were proposed in order to achieve better representational power or to reduce their computational complexity. In this work we consider the most advanced networks such as SqueezeNet [39] , Inception [37] , ResNet [40] , ResNeXt [41] , Xception [42] , ShuffleNet [43] and DenseNet [44] .

The SqueezeNet architecture is a deep CNN proposed for computer vision tasks with the main concerns on efficiency (having fewer parameters and smaller model size) [39] . The basic building block for the SqueezeNet architecture is the fire module depicted in Figure 1 .

The module incorporates the squeeze phase and the expand phase. The squeeze phase applies a set of 1 × 1 filters followed by a ReLU activation. The number of learned squeeze filters is always smaller than the size of the input volume. Consequently, the squeeze phase can be considered as a dimensionality reduction process, and at the same time it captures the pixel correlations across the input channels. The output of the squeeze phase is fed into the expand phase, in which a combination of 1 × 1 and 3 × 3 convolutions are learned. The larger 3 × 3 filters are used to capture the spatial correlation amongst pixels. The outputs of the expand phase are concatenated across the channel dimension and then evaluated by a ReLU activation. The original paper proposed using n, 1 × 1; and n, 3 × 3 filters in the expand phase, where n is 4× larger than number of filters used in the squeeze phase. The entire SqueezeNet architecture is constructed by stacking conventional convolution layers, max-pooling, fire modules, and ends with an average pooling layer. The model has no fully connected layers. For more details about the number of fire modules for each stage, their order, and number of squeeze and expand filters for the different stages, see [39] . 

The Inception network is a deep convolutional architecture introduced as GoogLeNet (Inception V1) in 2014 by Szegedy et al. [45] . The architecture has been refined in various ways such as adding batch normalization layers to accelerate training (Inception V2 [46] ), and factorizing convolutions with larger spatial filters for computational efficiency (Inception V3 [37] ). We adopt the Inception V3 model due to its outstanding performance in image recognition and object localization.

The fundamental building block for all Inceptionstyle networks is the Inception module of which several forms exist. Figure 2 shows one variant of the Inception module that is used in the Inception V3 model. The module accepts an input and then branches into four different paths each performing a specific set of operations. The input passes through convolutional layers with different kernel sizes (1 × 1 and 3 × 3) as well as a pooling operation. Applying different kernel sizes allows the module to capture complex patterns at different scales. The outputs of all branches are concatenated channel-wise.

The overall architecture of the Inception V3 network is composed of conventional 3 × 3 convolutional layers at the early stages of the network, where some of these layers are followed by max-pooling operations. Subsequently a stack of various Inception modules is applied. These modules have different designs with respect to the number of applied filters, filter sizes, depth of the module after symmetric or asymmetric factorization of larger convolutions, and when to expand the filter bank outputs. The last Inception module is followed by an average-pooling operation and a fully connected layer.

Deep Residual Networks (ResNet) proposed by He et al. in [40] , represent a family of extremely deep CNN architectures that won the 2015 Large Scale Visual Recognition Challenge (ILSVRC-2015) for image recognition, object detection and localization [8] . The winning network is composed of 152 layers, which confirms the beneficial impact of network depth on visual representations. However, two major problems are encountered when training networks of increasing depth; vanishing gradients and performance degradation. The authors addressed the problems by adding skip connections to prevent information loss as the network gets deeper.

The cornerstone for constructing deep residual networks is the residual module of which two variants are depicted in Figure 3 . The left path of the residual module in Figure 3 (a) is composed of two convolutional layers, which apply 3 × 3 kernels and preserve the spatial dimensions. Batch normalization and ReLU activation are also applied. The right path is the skip connection where the input is added to the output of the left path. This variant is used in the ResNet18 model. Another variant of the residual module named the bottleneck residual module is depicted in Figure 3 (b), in which the input signal also passes through two branches. However, the left path performs a series of convolutions using 1×1 and 3 × 3 kernel sizes, along with batch normalization and ReLU activation. The right path is the skip connection, which connects the module's input to an addition operation with the output of the left path. This variant is utilized in ResNet50 and ResNet101 models.

A deep residual network is constructed by stacking multiple residual modules on top of each other along with other conventional convolution and pooling layers. For our experiments we adopt three variants of ResNet, the ResNet18, ResNet50 and ResNet101 models. The full configurations and overall structure about each model are given in [40] .

The ResNeXt architecture proposed in [41] is a deep CNN model constructed by stacking residual building blocks of identical topology in a highly modularized fashion. Its simple design shares similarities with the ResNet architecture. ResNeXt also exploits the splittransform-merge strategy of the Inception module in an easy and extendable manner. The ResNeXt building block uses an identical set of transformations in every branch and hence allows the number of branches to be investigated as an independent hyperparameter. ResNeXt refers to the size of the set of transformations as the cardinality, which represents an important dimension for improving the network's representational power. Figure 4 [41] .

The entire network is constructed by stacking ResNeXt blocks along with other conventional convolution and pooling layers. For our experiments we implement two ResNeXt models, the 50-layer and the 101layer networks. In a similar manner as their ResNet counterparts, ResNeXt models use RGB-inputs of size 224×224. However, we found an input size of 349×253, similar to the ResNet models, achieves the best performance on the considered datasets.

Xception is a deep CNN architecture proposed in [42] . It is inspired by the Inception architecture and utilizes the residual connections proposed in ResNet models [40] . However, it replaces the Inception modules with depthwise separable convolution layers. A depthwise separable convolution consists of a depthwise convolution (spatial convolution of 3 × 3, 5 × 5, etc.) performed over each channel of an input to map the spatial correlations, followed by a pointwise convolution (1 × 1) to map the cross-channel correlations.

The Xception architecture depends entirely on depthwise separable convolution layers with a strong assumption that spatial correlations and cross-channel correlations can be mapped separately. The network consists of 36 convolutional layers structured into 14 modules. All modules have residual connections except for the first and last modules. The reader is referred to [42] for a complete description of the model specification.

Due to its superior performance in vision tasks, we adopt the Xception model in our experiments. Even though the original model uses an RGB-input of size 299 × 299, we found that an input size of 327 × 231 obtains the best results.

ShuffleNet is a very computationally-efficient CNN architecture that is mainly designed for mobile devices with constrained computational power [43, 47] . The architecture introduces two important operations to significantly reduce the computational cost while maintaining accuracy. The first operation is pointwise group convolutions, which can reduce the computational complexity of the 1 × 1 convolutions. The second operation consists of shuffling the channels, which assists the information flow across feature channels.

The cornerstone of the ShuffleNet model is the Shuf-fleNet unit depicted in Figure 5 . It is a bottleneck residual module in which the 3 × 3 convolutional layer is replaced by a 3 × 3 depthwise separable convolution as in [42] . Also, the first 1 × 1 convolutional layer is replaced by a pointwise group convolution followed by a channel shuffle operation. The second pointwise group convolutional layer is used to retrieve the channel dimension to match the left path of the unit. The overall ShuffleNet network is composed of a stack of these units grouped into three different stages along with other conventional convolution and pooling layers.

In this study we adopt the recent variant of the Shuf-fleNet architecture. The original model uses an RGB- input of 224 × 224, however, we found that an input resolution of 321 × 225 works better for the considered datasets.

Connected Convolutional Networks (DenseNets) are a class of CNN architectures introduced in [44] with several compelling characteristics. They alleviate the vanishing gradients problem, foster feature reuse, achieve high performance, consolidate feature propagation, and are computationally efficient. DenseNets modify the shortcut connections from ResNet by concatenating the output of the convolutions instead of summing them up. So, the input to the next layer will be the feature maps of all the preceding layers. Figure 6 shows a 3-layer Dense block where each layer performs a set of batch normalization (BN), ReLU activation and 3 × 3 Convolution operations. Previous feature maps are concatenated and presented as the input to a layer, which then generates k feature maps. k is a newly introduced hyper-parameter, denoted as the growth rate. Thus, if the input to layer x 0 is k 0 , then the number of feature maps at the end of a 3layer Dense block is 3 × k + k 0 . To prevent the number of feature maps from increasing too rapidly, DenseNet introduces a bottleneck layer with 1×1 convolution and 4 × k filters. To tackle the difference in the feature map sizes when transitioning from a large feature map to a smaller one, DenseNet applies a transition layer made of 1 × 1 convolution and average pooling. A deep DenseNet is constructed by stacking multiple Dense blocks with transition layers. Conventional convolution and pooling layers are used at the beginning of the network. Eventually the output is pooled by Global average pooling, flattened and passed to a softmax classifier. For our study we experiment with three variants of DenseNet, the 121-layer, 169-layer and 201layer architectures. The original models use an RGBinput of 224×224, however, we found that an input size of 349 × 253 achieves better results for images from the used datasets. Table 1 summarizes the important characteristics of the adopted deep CNN models. This includes the square-sized input for each network, our proposed custom-sized input, trainable parameters in millions, number of layers and the model size in megabytes.

Transfer learning is an effective representation learning approach in which the networks trained on abundant amount of images (millions) are used to initialize the networks for tasks for which data is scarce (a few hundreds or thousands of images). In the context of deep learning there are two common strategies to apply transfer learning from pretrained networks: feature extraction and fine-tuning [48, 49] . In the first strat-egy only the weights of some newly added layers are optimized during training, while in the second strategy all the weights are optimized for the new task. Here, we consider fine-tuning as a more effective strategy that outperforms feature extraction and achieves better performance. As our pretrained networks explicitly require an RGB-input, we assign identical values to the R, G and B channels. Since the CT images in the two datasets have varying spatial sizes, the images need to be scaled to match the target input size. One strategy to unify images with different aspect ratios involves stretching or excessive cropping. We opted for a different, less violating procedure and embed the image into a fixed-sized canvas. The aspect ratio of the original image is not altered and padding is applied to match the target shape.

This section presents our experimental setup and extensive experiments to show the efficacy of our fine-tuned networks. First, we describe the CT image datasets. Second, we state the experimental settings and performance evaluation metrics. Third, we discuss the obtained results of different models on each dataset. Finally, we apply two visualization methods to facilitate interpretation of the results and to localize the COVID-19 associated regions. [50] : The dataset was collected from hospitals of Sao Paulo, Brazil, with a total of 2482 CT scans acquired from 120 patients of both genders. It is composed of 1252 scans for patients infected with SARS-CoV-2 and 1230 scans for patients infected with other lung diseases. The CT scans have varying spatial sizes between 119 × 104 and 416 × 512, and are available in PNG format. CT scans from this dataset are shown in Figure 7 .

COVID19-CT dataset [51] : The dataset consists of a total of 746 CT images. There are 349 CT images of patients with COVID-19 and 397 CT images showing Non-COVID-19, but other pulmonary diseases. The positive CT images were collected from preprints about COVID-19 on medRxiv and bioRxiv, and they feature various manifestations of COVID-19. Since the CT images were taken from different sources, they have varying sizes between 124 × 153 and 1485 × 1853. Figure 8 shows example CT images from the COVID19-CT dataset. 

To assess the performance of our models we perform five-fold cross-validation. The final performance of the models is computed by averaging the obtained values from the five networks on their test fold respectively. Data augmentation methods are implemented to effectively increase the amount of training samples for improved generalization. Affine transformations like ro-tation and shearing turned out to have a worsening effect on performance, so we excluded this type of augmentations. More augmentation steps include cropping, adding blur with a probability of 25%, adding a random amount of Gaussian noise, changes in brightness and contrast and random horizontal flipping. Finally, the images are normalized according to the ImageNet dataset.

We follow a set of optimization configurations for all deep networks. The networks are optimized by applying the LAMB optimizer [12] on a binary cross-entropy loss. The initial learning rate is set to 0.0003 and is scheduled to decrease according to the following steps: epoch 50 

Here, we present and discuss the obtained results for detecting COVID-19 on the considered CT image datasets with different deep networks. We report the quantitative results along with the confusion matrices for every single architecture of the adopted networks. Table 2 summarizes the average values of the evaluation metrics achieved by different deep networks on the two CT image datasets. All values are given in percentages and the best results are written in bold. We also compare with the previously obtained results from the literature when applicable. Generally, we observe some performance differences between the obtained results on the SARS-CoV-2 CT and the COVID19-CT datasets. Also, we observe the superiority of our model compared with similar models from recently published works, which indicates the effectiveness of our optimization and learning strategy.

On the SARS-CoV-2 CT dataset, ResNet101 achieves the best overall performance with respect to almost all evaluation metrics, with an average accuracy and F1-score of 99.4% and 99.4%, respectively. The model also achieves an average sensitivity rate of 99.1% indicating that, on average, only two COVID-19 images are falsely predicted as negatives. It is also powerful enough to correctly identify all Non-COVID-19 cases with only one false positive resulting a specificity rate of 99.6%. The highest sensitivity score of 99.8% is achieved by the InceptionV3 model, where only one COVID-19 image is falsely predicted as negative on average. The SqueezeNet model obtains the lowest performance with respect to all evaluation metrics with a fairly acceptable average accuracy and sensitivity scores of 95.1% and 96.2%, respectively. Also the ShuffleNet architecture obtains satisfactory performance with approximately 2% improvements on average for all metrics compared with SqueezeNet. Although the results obtained by these models are inferior compared with the rest of models, but they are more efficient. This matches their main objective of reducing the computational costs rather than improving their visual recognition abilities. The rest of models achieve competitive performance and very promising results with slight performance differences. Comparing the different variants of ResNet and DenseNet, we can see that the deeper variants from each architecture yield a slightly better performance. The deeper ResNet101 and ResNeXt101 show a marginal gain in performance compared with their shallower counterparts. The details about classwise results for each model are summarized in the confusion matrices in Figure 9 .

It is worthy mentioning that on the SARS-CoV-2 CT dataset the inter-fold variations are minimal and usually below one percent, showing the robustness of our fine-tuning strategy. For some of the architectures like the DenseNet variants we observe a larger confidence interval than their actual differences in recognition performance. This means that the DenseNets and the deeper ResNet variants share a very similar performance and are almost indistinguishable from each other. Overall, the obtained results by our models are better than the recently published ones even when using the same network architectures. We attribute this to the better optimization and transferability of the learned features when applying our fine-tuning strategy.

On the COVID19-CT dataset, the overall performance with respect to all evaluation metrics is inferior [55] 85.9 ± 5.9 -84.9 ± 8.4 86.8 ± 6.3 -DenseNet169 [56] 87.7 ± 4.7 90.2 ± 6.0 85.6 ± 6.7 -87.8 ± 5.0 Contrastive Learning [24] 78.6 ± 1.5 78.0 ± 1.3 79.7 ± 1.4 -78. 8 ± 1.4 to that on the SARS-CoV-2 dataset. This can be attributed to the cross-source heterogeneity of the CT images in the dataset. The Non-COVID-19 CT images were taken from different sources and show diverse findings which pose difficulty to distinguish between COVID-19 and other findings associated with lung diseases due to the potential overlap of visual manifestations (see Figure 8 ). Another reason is that, the CT images in the COVID19-CT dataset show strong variations in contrast, variable spatial resolution and other visual characteristics, which could affect the model's ability to extract more discriminative and generalizable features.

It is also worthy mentioning that for the COVID19-CT dataset the inter-fold variations grow substantially due to the small size of the dataset. During 5-fold crossvalidation the training set consists of about 600 images only and the test fold has less than 200 images, which has to produce statistical fluctuations. Metrics considering the overall performance like the accuracy have less inter-fold variation. However, we observe stronger variations in metrics, that test the bias towards one of the classes like the specificity. The standard deviation of the specificity indicates that the different folds tend to encourage the model to focus more on COVID or more on Non-COVID cases. This phenomenon occurs even for stratified 5-fold cross-validation, where the distribution of classes in each fold represents the class distribution of the entire dataset, and it seems to originate from the small number of images only.

Our models achieve fairly good performance compared with the recently published work using the exact network architectures. This can bet attributed to a better optimization of our models and the effectiveness have nearly identical results for all evaluation metrics.

We observe that small-sized networks such as ResNet18 achieves comparable results with other deeper models. The SqueezeNet and ShuffleNet models perform at a similar level of accuracy. The variants of the ResNeXt models have comparable results and perform as good as the different ResNet variants. A detailed analysis on the class-wise results for individual models is presented in the confusion matrices in Figure 10 .

This subsection provides visual explanations to make our models more transparent. We start with a 2D projection of the learned features using t-SNE [13] , and then present the localization maps for highlighting the COVID-19 associated regions using Grad-CAM [14] .

To understand how the deep neural networks represent the CT images in the high-dimensional feature space we apply the t-SNE algorithm to visualize these features. For each image in the SARS-CoV-2 dataset we first extract the 2048-dimensional feature vector from the penultimate layer of the Inception V3 model. Next, we apply t-SNE to map the features on to 2D space and then visualize the embeddings of training and test representations. Figure 11 clearly shows two well-separated clusters of the CT images of COVID-19 and Non-COVID- 19 . This indicates that the distribution of training and test features are quite similar to each other, which indicates good generalization capabilities of our model. The clear and wide margin between the two classes shows how nicely the CT images are separated in feature space. We also repeat the same procedure for the COVID19-CT dataset. The feature vectors are extracted from the penultimate layer of the DenseNet169 model. The length of the feature vectors is 1664 dimensions. We again apply t-SNE to map the features on to 2D space to explore and visualize them. Figure 12 shows two clusters representing CT images for the COVID-19 and Non-COVID-19 classes. Even though the classes are fairly distinguishable with a clear decision boundary, however, we can see that some CT images are misclassified, and more specifically the Non-COVID-19 CT images from the test set.

In order to make our models more transparent and provide detailed visual analysis, we present the Grad-CAM localization maps obtained by different models. We consider CT images with COVID-19 abnormalities from the test set of each dataset and highlight the important regions considered for the prediction. For the SARS-CoV-2 dataset we use the Inception V3 model. Figure 13 shows the original CT images and their localization maps. Our model is capable to detect regions that show abnormalities in the CT scans.

In a similar way, we consider classifying the test CT scans from the COVID-19 dataset by the DenseNet169 model and highlight the important regions considered for predictions. We present the original CT images and their localization maps in Figure 13 . We can also see that our model is capable to detect the COVID-19 related regions as marked (small square in some images) by expert radiologists.

A wide variety of typical and atypical CT abnormalities have been reported for COVID-19 patients in various studies [58, 59] . So, we tested our models on external CT images extracted from these two publications as they feature typical findings of COVID-19 pneumonia marked by specialists. In order to make sure that not any of the extracted images are unintentionally included in our datasets, specifically the COVID19-CT dataset, we use the model trained on the SARS-CoV-2 dataset. First, the InceptionV3 model is employed to classify the extracted CT images. The model is able to correctly classify the given CT images as COVID-19. Second, in order to interpret the model's generalization capabilities, we apply the Grad-CAM technique to visualize the regions of abnormalities that are considered. By assessing the different CT images in Figure 15 , we can see that the model accurately localizes the diseaserelated regions. Even more interesting is the fact that the model ignores any specific marks in the images like letters and only localizes the COVID-19 related regions. These visual explanations show the success of our models to learn relevant, generic visual features related to COVID-19 and are capable to correctly classify CT images outside the datasets on which they are trained. Figure 16 shows various CT scans where only one lung is visible. The CT scans are also extracted from the paper [58] and show different CT manifestations of COVID-19 pneumonia marked by red squares. The InceptionV3 model is capable to classify them correctly as COVID-19, although it is trained on CT scans where the entire lung is visible. Intriguingly, when applying Grad-CAM we can see that all regions of abnormalities are accurately localized. This also proves the potential of our model to detect COVID-19 abnormalities in CT images outside the dataset used for training.

We proposed different deep learning based approaches for accurate COVID-19 detection using chest CT images. The most advanced deep network architectures and their variants were considered and extensive experiments were conducted on the two datasets with the largest amount of CT images available so far. Moreover, we investigated different configurations and determined custom-sized input for each network to achieve the best detection performance. The resulting networks Fig. 11 : Visualization of the t-SNE embeddings for the entire SARS-CoV-2 CT dataset. We clearly see two different clusters representing COVID-19 (red for train and blue for test samples) and Non-COVID-19 (yellow for train and green for test samples) classes. Fig. 12 : Visualization of the t-SNE embeddings for the entire COVID-19 CT dataset. As in Figure 11 we can see two different clusters representing COVID-19 and Non-COVID-19 classes. showed a significantly improved performance for detecting COVID-19. Our models achieved state-of-the-art performance with an average accuracy of 99.4% and 92.9%, and a sensitivity score of 99.8% and 93.7% on Fig. 15 : Grad-CAM visualizations for CT images taken from two publications [58, 59] . The CT images were correctly classified as COVID-19 and the disease-related regions are accurately localized as marked by specialists. Fig. 16 : Grad-CAM visualizations for CT images taken from [58] . The CT scans show different manifestations of COVID-19 marked by red frames or white arrows. Our model was able to identify them as COVID-19 and accurately localize the COVID-19 associated abnormalities.

the SARS-CoV-2 CT and COVID19-CT datasets, respectively. This indicates the effectiveness of our proposed approaches and the potential of using deep learning for fully automated and fast diagnosis of COVID-19. In order to explain the obtained results we em-ployed two visualization methods. First, we explored the learned features using the t-SNE algorithm and the resulting visualizations showed well-separated clusters for COVID-19 and Non-COVID-19 cases. We also assessed the obtained networks using the Grad-CAM al-gorithm to obtain high-resolution visualizations showing the discriminative regions of abnormalities in the CT images. Moreover, we tested our models on external CT images from different publications. Our models were capable to detect all COVID-19 cases and accurately localize the COVID-19 associated regions as marked by expert radiologists.

Community transmission of severe acute respiratory syndrome coronavirus 2, shenzhen, china, 2020

First known person-to-person transmission of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in the USA

Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in wuhan, china: A descriptive study

Diagnosis of the Coronavirus disease (COVID-19): rRT-PCR or CT?

Sensitivity of chest CT for COVID-19: Comparison to RT-PCR

Correlation of chest CT and RT-PCR testing in coronavirus disease 2019 (COVID-19) in China: A report of 1014 cases

Chest ct findings in 2019 novel coronavirus (2019-ncov) infections from wuhan, china: Key points for the radiologist

Imagenet large scale visual recognition challenge

A survey of transfer learning

ImageNet: A Large-scale Hierarchical Image Database

Deep convolutional neural networks for unconstrained ear recognition

Large batch optimization for deep learning: Training bert in 76 minutes

Visualizing data using t-SNE

Grad-CAM: visual explanations from deep networks via gradient-based localization

Deep convolutional neural networks for computeraided detection: CNN architectures, dataset characteristics and transfer learning

Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning

Using Artificial Intelligence to Detect COVID-19 and Community-acquired Pneumonia Based on Pulmonary CT: Evaluation of the Diagnostic Accuracy

A deep learning system to screen novel coronavirus disease 2019 pneumonia

Deep learning-based detection for COVID-19 from chest CT using weak label

A deep learning algorithm using CT images to screen for Corona Virus Disease (COVID-19)

Lung infection quantification of covid-19 in ct images with deep learning

COVID-Net: A Tailored Deep Convolutional Neural Network Design for Detection of COVID-19 Cases from Chest X-Ray Images

CVR-Net: A deep convolutional neural network for coronavirus recognition from chest radiography images

Contrastive Crosssite Learning with Redesigned Net for COVID-19 CT Classification

COVID-ResNet: A deep learning framework for screening of covid19 from radiographs

CoroNet: A deep neural network for detection and diagnosis of COVID-19 from chest x-ray images

COVID-CAPS: A capsule network-based framework for identification of covid-19 cases from x-ray images

Convolutional capsnet: A novel artificial neural network approach to detect covid-19 disease from x-ray images using capsule networks

Covid-19: automatic detection from x-ray images utilizing transfer learning with convolutional neural networks

Deep-COVID: Predicting COVID-19 From Chest X-Ray Images Using Deep Transfer Learning

Explainable deep learning for pulmonary disease and coronavirus COVID-19 detection from X-rays

Very deep convolutional networks for large-scale image recognition

Improved deep learning model for differentiating novel coronavirus pneumonia and influenza pneumonia

Deep learning enables accurate diagnosis of novel coronavirus (COVID-19) with CT images

Classification of the COVID-19 infected patients using DenseNet201 based deep transfer learning

COVID-19 identification in chest X-ray images on flat and hierarchical classification scenarios

Rethinking the inception architecture for computer vision

Machine Learning Analysis of Chest CT Scan Images as a Complementary Digital Test of Coronavirus (COVID-19) Patients," medRxiv

Squeezenet: Alexnetlevel accuracy with 50x fewer parameters and¡ 0

mb model size

Deep residual learning for image recognition

Aggregated residual transformations for deep neural networks

Xception: Deep learning with depthwise separable convolutions

Shuf-fleNet V2: Practical guidelines for efficient CNN architecture design

Densely connected convolutional networks

Going deeper with convolutions

Batch normalization: Accelerating deep network training by reducing internal covariate shift

Shufflenet: An extremely efficient convolutional neural network for mobile devices

What makes ImageNet good for transfer learning

Ensembles of deep learning models and transfer learning for ear recognition

SARS-CoV-2 CT-scan dataset: A large dataset of real patients CT scans for SARS-CoV-2 identification

Sample-Efficient Deep Learning for COVID-19 Diagnosis Based on CT Scans

A Deep Learning and Grad-CAM based Color Visualization Approach for Fast Detection of COVID-19 Cases using Chest X-ray and CT-Scan Images

COVID CT-Net: Predicting Covid-19 From Chest CT Images Using Attentional Convolutional Network

Identifying COVID19 from Chest CT Images: A Deep Convolutional Neural Networks Based Approach

Classification of covid-19 in ct scans using multi-source transfer learning

COVID19 detection from Radiographs: Is Deep Learning able to handle the crisis

Chest CT manifestations of new coronavirus disease 2019 (COVID-19): a pictorial review

COVID-19 pneumonia: A review of typical CT findings and differential diagnosis